system design

System Design: Data Warehouse

Design a scalable data warehouse for a growing e-commerce company. Walk through decisions on storage, processing, modelling, and serving.

System Design: Data Warehouse

The Problem

Design a data warehouse for an e-commerce company with 5M daily orders, 50+ data sources, and a team of 20 analysts. Stakeholders need dashboards updated every hour. The team uses Python and SQL.


Framework: Ask Before You Design

Before jumping to solutions, clarify:

  • Scale: How many orders/day? How many historical years of data?
  • Latency requirement: Real-time, near-real-time (minutes), or batch (hourly/daily)?
  • Query patterns: Aggregates? Ad-hoc exploration? ML feature generation?
  • Team skills: SQL-only? Python? Spark?
  • Budget: Managed cloud vs self-hosted?

  • Architecture Decision: Cloud Warehouse

    For 5M orders/day ≈ 150M rows/month, a managed cloud warehouse is appropriate:

    OptionBest for
    SnowflakeMulti-cloud, fine-grained compute control, data sharing
    BigQueryGCP ecosystem, serverless, low-maintenance
    RedshiftAWS ecosystem, tight integration with S3/Glue
    DatabricksSpark workloads, ML, open format (Delta Lake)
    Recommend Snowflake or BigQuery for most teams: managed, scalable, SQL-native.


    Ingestion Layer

    Three categories of sources:

  • Operational DB (Postgres/MySQL) → Fivetran/Airbyte CDC or nightly dump → Bronze layer
  • Event streams (Kafka/Kinesis) → Spark Streaming or Flink → Bronze layer
  • Third-party APIs (Stripe, Google Ads, Salesforce) → Fivetran connectors
  • All land in a raw/bronze schema — no transformation, append-only.


    Transformation Layer

    Use dbt for SQL transforms:

  • Staging → clean and type-cast raw tables (1:1 with sources)
  • Intermediate → business joins (orders + customers + products)
  • Marts → dimensional model (fact_orders, dim_customer, dim_product)
  • Aggregates → pre-computed summaries for dashboards
  • Run on hourly schedule (matches the 1-hour SLA) via Airflow or dbt Cloud.


    Serving Layer

  • BI dashboards → Tableau/Looker/Metabase connect directly to warehouse
  • Data API → FastAPI + SQLAlchemy for app teams needing structured queries
  • ML features → Feature store (Feast) or pre-computed feature tables in warehouse

  • Reliability & Governance

  • Data quality: dbt tests on every mart (not_null, unique, row count assertions)
  • Monitoring: dbt source freshness alerts, Airflow task failure alerts
  • Access control: Row-level security in Snowflake; role-based access
  • Schema evolution: backward compatible changes (add columns, never drop); versioned models
  • Disaster recovery: point-in-time restore on warehouse; raw data in S3 is ground truth

  • What to Say in an Interview

    Structure your answer:

  • Clarify requirements (don't assume)
  • Sketch the high-level data flow: Source → Ingestion → Storage → Transform → Serve
  • Justify each technology choice with trade-offs
  • Address scale, latency, and failure modes
  • Mention testing and monitoring
  • Avoid: jumping straight to specific tools without justification.

    Your design notes

    Work through this problem yourself before reading the walkthrough above. Your notes are stored locally and not submitted anywhere — only sent to the AI when you click Review.