lesson

Pipeline Architecture

Batch vs streaming, orchestration with Airflow, idempotency, and fault tolerance.

Pipeline Architecture

Batch vs Streaming

Batch	Streaming
Latency	Minutes to hours	Sub-second to seconds
Throughput	Very high	Medium
Complexity	Lower	Higher
Cost	Lower (scheduled)	Higher (always-on infra)
Use cases	Nightly reports, ETL, ML training	Fraud detection, live dashboards, alerting

Most data platforms run both: streaming for real-time signals, batch for heavy historical computation.

Apache Airflow

Airflow is the most common data orchestration tool. Key concepts:

DAG (Directed Acyclic Graph) — a workflow definition; Python code

Task — a unit of work (Python function, SQL query, Bash script)

Operator — task template (PythonOperator, BashOperator, PostgresOperator, etc.)

Scheduler — parses DAGs, triggers runs based on schedule

Executor — runs tasks (LocalExecutor, CeleryExecutor, KubernetesExecutor)

XCom — cross-task communication (small values only — avoid for large data)

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
def extract(): ...
def transform(): ...
def load(): ...with DAG("my_etl", start_date=datetime(2024, 1, 1), schedule="@daily") as dag:
    t1 = PythonOperator(task_id="extract", python_callable=extract)
    t2 = PythonOperator(task_id="transform", python_callable=transform)
    t3 = PythonOperator(task_id="load", python_callable=load)
    t1 >> t2 >> t3

Idempotency

A pipeline is idempotent if running it multiple times produces the same result as running it once. This is critical for safe retries.

Patterns for idempotency:

Overwrite partitions instead of appending: INSERT OVERWRITE PARTITION (dt='2024-01-01')

Upsert / MERGE on a natural key

Truncate + reload for small dimension tables

Deduplication step before insert

Common Architecture Patterns

Lambda Architecture

Batch layer: reprocesses all historical data periodically (ground truth)

Speed layer: real-time stream processing for low-latency approximations

Serving layer: merges batch and speed views

Drawback: two codebases to maintain (batch + streaming logic).

Kappa Architecture

Single stream processing system handles both real-time and historical reprocessing

Simpler: one codebase, but requires replayable message log (Kafka)

Medallion Architecture (common in Databricks/Delta Lake)

Bronze: raw ingested data (no transformation)

Silver: cleaned, validated, deduplicated

Gold: business-ready aggregates, dimensional models

Fault Tolerance Checklist

[ ] Tasks are idempotent (safe to retry)

[ ] Dead letter queue / alerting for failed messages

[ ] Exactly-once or at-least-once semantics documented

[ ] Backfill strategy defined

[ ] Monitoring and alerting on lag, failure rate, data freshness

[ ] Schema evolution strategy (backward compatible changes)

Pipeline Architecture

Batch vs Streaming

Apache Airflow

Idempotency

Common Architecture Patterns

Lambda Architecture

Kappa Architecture

Medallion Architecture (common in Databricks/Delta Lake)

Fault Tolerance Checklist

Resources