SQL Foundations
Window functions, CTEs, subqueries, and query optimisation — the SQL skills every data engineering interview tests.
SQL Foundations
Why SQL is Non-Negotiable
SQL remains the lingua franca of data. Even in a world of Spark, dbt, and Flink, every interview starts with SQL. Data engineers need *fluency*, not just familiarity.
Window Functions
Window functions compute a value across a set of rows related to the current row — without collapsing them like GROUP BY does.
SELECT
employee_id,
department,
salary,
AVG(salary) OVER (PARTITION BY department) AS dept_avg,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank,
LAG(salary) OVER (ORDER BY hire_date) AS prev_hire_salary
FROM employees;Key window functions:
| Function | Purpose |
|---|---|
ROW_NUMBER() | Unique sequential integer per partition |
RANK() / DENSE_RANK() | Rank with / without gaps on ties |
LAG(col, n) / LEAD(col, n) | Value n rows before / after |
SUM / AVG / COUNT OVER | Running or partitioned aggregates |
NTILE(n) | Divide rows into n buckets |
FIRST_VALUE / LAST_VALUE | First or last value in window frame |
Frame Clauses
SUM(sales) OVER (
PARTITION BY region
ORDER BY month
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW -- running total
)ROWS — physical rows. RANGE — logical range (handles ties differently). Default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
Common Table Expressions (CTEs)
CTEs (WITH clauses) make complex queries readable and reusable within one statement.
WITH daily_revenue AS (
SELECT
DATE_TRUNC('day', order_date) AS day,
SUM(amount) AS revenue
FROM orders
GROUP BY 1
),
moving_avg AS (
SELECT
day,
revenue,
AVG(revenue) OVER (ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS ma_7d
FROM daily_revenue
)
SELECT * FROM moving_avg WHERE day >= CURRENT_DATE - 30;Recursive CTEs — used for hierarchical data (org charts, paths in a graph):
WITH RECURSIVE org_tree AS (
SELECT id, manager_id, name, 0 AS depth
FROM employees WHERE manager_id IS NULL -- anchor: CEO UNION ALL
SELECT e.id, e.manager_id, e.name, ot.depth + 1
FROM employees e
JOIN org_tree ot ON e.manager_id = ot.id -- recursive step
)
SELECT * FROM org_tree ORDER BY depth, name;
Query Optimisation
EXPLAIN / EXPLAIN ANALYZE
Always check the query plan before declaring a query "optimised":
EXPLAIN ANALYZE
SELECT customer_id, COUNT(*) FROM orders GROUP BY customer_id;Look for: Seq Scan (table scan — may need an index), Hash Join vs Nested Loop vs Merge Join, estimated vs actual rows.
Index Types
| Type | Best for |
|---|---|
| B-Tree (default) | Equality, range, ORDER BY, LIKE 'prefix%' |
| Hash | Equality only |
| GIN | Arrays, JSONB, full-text search |
| BRIN | Very large, naturally ordered tables (timestamps) |
Common Patterns
SELECT * — read only what you need; reduces I/O and disables index-only scans.CREATE INDEX ON orders(created_at) WHERE status = 'pending';NOT IN vs NOT EXISTS — NOT IN returns NULL if any element is NULL; prefer NOT EXISTS for safety.Interview Patterns to Memorise
SUM() OVER (ORDER BY date ROWS UNBOUNDED PRECEDING)LAG(metric, 12) OVER (PARTITION BY category ORDER BY month)ROW_NUMBER(), filter on rn = 1ROW_NUMBER() OVER (PARTITION BY id ORDER BY updated_at DESC), keep rn = 1LAG(event_time) to find gaps > thresholdSUM(CASE WHEN category = 'A' THEN amount END)Sign in to use the AI study buddy on this lesson.