lesson

CI/CD & Observability

GitHub Actions, deployment strategies, metrics/logs/traces, and alerting.

CI/CD & Observability

CI/CD Pipeline

Code push → Lint → Test → Build → Deploy to staging → Integration tests → Deploy to prod

Deployment Strategies

StrategyHow it worksRiskRollback speed
RollingReplace pods graduallyMediumMedium
Blue-GreenTwo identical environments, switch trafficLowInstant
CanaryRoute 5% of traffic to new version, monitor, then scaleLowestFast
RecreateKill all old, start all newHighestSlow

The Three Pillars of Observability

Metrics (Prometheus / Datadog)

Numeric measurements over time. The four golden signals:
  • Latency: how long requests take (p50, p95, p99)
  • Traffic: requests per second
  • Errors: error rate (5xx / total)
  • Saturation: how full your resources are (CPU, memory, disk)
  • Logs (ELK / Loki)

    Structured events with context:
    {"level":"error","msg":"payment failed","user_id":"123","error":"card_declined","ts":"2024-01-01T12:00:00Z"}

    Traces (Jaeger / OpenTelemetry)

    Follow a single request across multiple services. Each span has: service name, operation, duration, parent span.


    SLOs, SLIs, SLAs

  • SLI (Service Level Indicator): the metric (e.g. "99.2% of requests complete in < 200ms")
  • SLO (Service Level Objective): the target (e.g. "99.9% availability per month")
  • SLA (Service Level Agreement): the contract (e.g. "if we drop below 99.9%, you get credits")
  • Error budget: 100% - SLO = how much downtime you're allowed (99.9% = 43 min/month)
  • Sign in to use the AI study buddy on this lesson.

    Resources