system design

System Design: Data Warehouse

Design a scalable data warehouse for a growing e-commerce company. Walk through decisions on storage, processing, modelling, and serving.

System Design: Data Warehouse

The Problem

Design a data warehouse for an e-commerce company with 5M daily orders, 50+ data sources, and a team of 20 analysts. Stakeholders need dashboards updated every hour. The team uses Python and SQL.

Framework: Ask Before You Design

Before jumping to solutions, clarify:

Scale: How many orders/day? How many historical years of data?

Latency requirement: Real-time, near-real-time (minutes), or batch (hourly/daily)?

Query patterns: Aggregates? Ad-hoc exploration? ML feature generation?

Team skills: SQL-only? Python? Spark?

Budget: Managed cloud vs self-hosted?

Architecture Decision: Cloud Warehouse

For 5M orders/day ≈ 150M rows/month, a managed cloud warehouse is appropriate:

Option	Best for
Snowflake	Multi-cloud, fine-grained compute control, data sharing
BigQuery	GCP ecosystem, serverless, low-maintenance
Redshift	AWS ecosystem, tight integration with S3/Glue
Databricks	Spark workloads, ML, open format (Delta Lake)

Recommend Snowflake or BigQuery for most teams: managed, scalable, SQL-native.

Ingestion Layer

Three categories of sources:

Operational DB (Postgres/MySQL) → Fivetran/Airbyte CDC or nightly dump → Bronze layer

Event streams (Kafka/Kinesis) → Spark Streaming or Flink → Bronze layer

Third-party APIs (Stripe, Google Ads, Salesforce) → Fivetran connectors

All land in a raw/bronze schema — no transformation, append-only.

Transformation Layer

Use dbt for SQL transforms:

Staging → clean and type-cast raw tables (1:1 with sources)

Intermediate → business joins (orders + customers + products)

Marts → dimensional model (fact_orders, dim_customer, dim_product)

Aggregates → pre-computed summaries for dashboards

Run on hourly schedule (matches the 1-hour SLA) via Airflow or dbt Cloud.

Serving Layer

BI dashboards → Tableau/Looker/Metabase connect directly to warehouse

Data API → FastAPI + SQLAlchemy for app teams needing structured queries

ML features → Feature store (Feast) or pre-computed feature tables in warehouse

Reliability & Governance

Data quality: dbt tests on every mart (not_null, unique, row count assertions)

Monitoring: dbt source freshness alerts, Airflow task failure alerts

Access control: Row-level security in Snowflake; role-based access

Schema evolution: backward compatible changes (add columns, never drop); versioned models

Disaster recovery: point-in-time restore on warehouse; raw data in S3 is ground truth

What to Say in an Interview

Structure your answer:

Clarify requirements (don't assume)

Sketch the high-level data flow: Source → Ingestion → Storage → Transform → Serve

Justify each technology choice with trade-offs

Address scale, latency, and failure modes

Mention testing and monitoring

Avoid: jumping straight to specific tools without justification.

Your design notes

Work through this problem yourself before reading the walkthrough above. Your notes are stored locally and not submitted anywhere — only sent to the AI when you click Review.