lesson

Pipeline Architecture

Batch vs streaming, orchestration with Airflow, idempotency, and fault tolerance.

Pipeline Architecture

Batch vs Streaming

BatchStreaming
LatencyMinutes to hoursSub-second to seconds
ThroughputVery highMedium
ComplexityLowerHigher
CostLower (scheduled)Higher (always-on infra)
Use casesNightly reports, ETL, ML trainingFraud detection, live dashboards, alerting
Most data platforms run both: streaming for real-time signals, batch for heavy historical computation.


Apache Airflow

Airflow is the most common data orchestration tool. Key concepts:

  • DAG (Directed Acyclic Graph) — a workflow definition; Python code
  • Task — a unit of work (Python function, SQL query, Bash script)
  • Operator — task template (PythonOperator, BashOperator, PostgresOperator, etc.)
  • Scheduler — parses DAGs, triggers runs based on schedule
  • Executor — runs tasks (LocalExecutor, CeleryExecutor, KubernetesExecutor)
  • XCom — cross-task communication (small values only — avoid for large data)
  • from airflow import DAG
    from airflow.operators.python import PythonOperator
    from datetime import datetime

    def extract(): ... def transform(): ... def load(): ...

    with DAG("my_etl", start_date=datetime(2024, 1, 1), schedule="@daily") as dag: t1 = PythonOperator(task_id="extract", python_callable=extract) t2 = PythonOperator(task_id="transform", python_callable=transform) t3 = PythonOperator(task_id="load", python_callable=load) t1 >> t2 >> t3


    Idempotency

    A pipeline is idempotent if running it multiple times produces the same result as running it once. This is critical for safe retries.

    Patterns for idempotency:

  • Overwrite partitions instead of appending: INSERT OVERWRITE PARTITION (dt='2024-01-01')
  • Upsert / MERGE on a natural key
  • Truncate + reload for small dimension tables
  • Deduplication step before insert

  • Common Architecture Patterns

    Lambda Architecture

  • Batch layer: reprocesses all historical data periodically (ground truth)
  • Speed layer: real-time stream processing for low-latency approximations
  • Serving layer: merges batch and speed views
  • Drawback: two codebases to maintain (batch + streaming logic).

    Kappa Architecture

  • Single stream processing system handles both real-time and historical reprocessing
  • Simpler: one codebase, but requires replayable message log (Kafka)
  • Medallion Architecture (common in Databricks/Delta Lake)

  • Bronze: raw ingested data (no transformation)
  • Silver: cleaned, validated, deduplicated
  • Gold: business-ready aggregates, dimensional models

  • Fault Tolerance Checklist

  • [ ] Tasks are idempotent (safe to retry)
  • [ ] Dead letter queue / alerting for failed messages
  • [ ] Exactly-once or at-least-once semantics documented
  • [ ] Backfill strategy defined
  • [ ] Monitoring and alerting on lag, failure rate, data freshness
  • [ ] Schema evolution strategy (backward compatible changes)
  • Sign in to use the AI study buddy on this lesson.

    Resources