lesson

SQL Foundations

Window functions, CTEs, subqueries, and query optimisation — the SQL skills every data engineering interview tests.

SQL Foundations

Why SQL is Non-Negotiable

SQL remains the lingua franca of data. Even in a world of Spark, dbt, and Flink, every interview starts with SQL. Data engineers need *fluency*, not just familiarity.


Window Functions

Window functions compute a value across a set of rows related to the current row — without collapsing them like GROUP BY does.

SELECT
  employee_id,
  department,
  salary,
  AVG(salary) OVER (PARTITION BY department)   AS dept_avg,
  RANK()       OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank,
  LAG(salary)  OVER (ORDER BY hire_date)        AS prev_hire_salary
FROM employees;

Key window functions:

FunctionPurpose
ROW_NUMBER()Unique sequential integer per partition
RANK() / DENSE_RANK()Rank with / without gaps on ties
LAG(col, n) / LEAD(col, n)Value n rows before / after
SUM / AVG / COUNT OVERRunning or partitioned aggregates
NTILE(n)Divide rows into n buckets
FIRST_VALUE / LAST_VALUEFirst or last value in window frame

Frame Clauses

SUM(sales) OVER (
  PARTITION BY region
  ORDER BY month
  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW   -- running total
)

ROWS — physical rows. RANGE — logical range (handles ties differently). Default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.


Common Table Expressions (CTEs)

CTEs (WITH clauses) make complex queries readable and reusable within one statement.

WITH daily_revenue AS (
  SELECT
    DATE_TRUNC('day', order_date) AS day,
    SUM(amount)                   AS revenue
  FROM orders
  GROUP BY 1
),
moving_avg AS (
  SELECT
    day,
    revenue,
    AVG(revenue) OVER (ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS ma_7d
  FROM daily_revenue
)
SELECT * FROM moving_avg WHERE day >= CURRENT_DATE - 30;

Recursive CTEs — used for hierarchical data (org charts, paths in a graph):

WITH RECURSIVE org_tree AS (
  SELECT id, manager_id, name, 0 AS depth
  FROM employees WHERE manager_id IS NULL        -- anchor: CEO

UNION ALL

SELECT e.id, e.manager_id, e.name, ot.depth + 1 FROM employees e JOIN org_tree ot ON e.manager_id = ot.id -- recursive step ) SELECT * FROM org_tree ORDER BY depth, name;


Query Optimisation

EXPLAIN / EXPLAIN ANALYZE

Always check the query plan before declaring a query "optimised":

EXPLAIN ANALYZE
SELECT customer_id, COUNT(*) FROM orders GROUP BY customer_id;

Look for: Seq Scan (table scan — may need an index), Hash Join vs Nested Loop vs Merge Join, estimated vs actual rows.

Index Types

TypeBest for
B-Tree (default)Equality, range, ORDER BY, LIKE 'prefix%'
HashEquality only
GINArrays, JSONB, full-text search
BRINVery large, naturally ordered tables (timestamps)

Common Patterns

  • Avoid SELECT * — read only what you need; reduces I/O and disables index-only scans.
  • Filter early — push predicates into CTEs or subqueries; don't filter after a large join.
  • Partial indexes — index only the rows you query: CREATE INDEX ON orders(created_at) WHERE status = 'pending';
  • Covering indexes — include all columns referenced in a query to enable index-only scans.
  • NOT IN vs NOT EXISTSNOT IN returns NULL if any element is NULL; prefer NOT EXISTS for safety.

  • Interview Patterns to Memorise

  • Running totalSUM() OVER (ORDER BY date ROWS UNBOUNDED PRECEDING)
  • YoY growthLAG(metric, 12) OVER (PARTITION BY category ORDER BY month)
  • Top-N per group → CTE with ROW_NUMBER(), filter on rn = 1
  • Deduplication → CTE with ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at DESC), keep rn = 1
  • Session gapsLAG(event_time) to find gaps > threshold
  • Pivot → conditional aggregation: SUM(CASE WHEN category = 'A' THEN amount END)
  • Sign in to use the AI study buddy on this lesson.

    Resources