Pipeline Architecture
Batch vs streaming, orchestration with Airflow, idempotency, and fault tolerance.
Pipeline Architecture
Batch vs Streaming
| Batch | Streaming | |
|---|---|---|
| Latency | Minutes to hours | Sub-second to seconds |
| Throughput | Very high | Medium |
| Complexity | Lower | Higher |
| Cost | Lower (scheduled) | Higher (always-on infra) |
| Use cases | Nightly reports, ETL, ML training | Fraud detection, live dashboards, alerting |
Apache Airflow
Airflow is the most common data orchestration tool. Key concepts:
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetimedef extract(): ...
def transform(): ...
def load(): ...
with DAG("my_etl", start_date=datetime(2024, 1, 1), schedule="@daily") as dag:
t1 = PythonOperator(task_id="extract", python_callable=extract)
t2 = PythonOperator(task_id="transform", python_callable=transform)
t3 = PythonOperator(task_id="load", python_callable=load)
t1 >> t2 >> t3
Idempotency
A pipeline is idempotent if running it multiple times produces the same result as running it once. This is critical for safe retries.
Patterns for idempotency:
INSERT OVERWRITE PARTITION (dt='2024-01-01')Common Architecture Patterns
Lambda Architecture
Drawback: two codebases to maintain (batch + streaming logic).
Kappa Architecture
Medallion Architecture (common in Databricks/Delta Lake)
Fault Tolerance Checklist
Sign in to use the AI study buddy on this lesson.