lesson

CI/CD & Observability

GitHub Actions, deployment strategies, metrics/logs/traces, and alerting.

CI/CD Pipeline

Code push → Lint → Test → Build → Deploy to staging → Integration tests → Deploy to prod

Strategy	How it works	Risk	Rollback speed
Rolling	Replace pods gradually	Medium	Medium
Blue-Green	Two identical environments, switch traffic	Low	Instant
Canary	Route 5% of traffic to new version, monitor, then scale	Lowest	Fast
Recreate	Kill all old, start all new	Highest	Slow

Numeric measurements over time. The four golden signals:

Latency: how long requests take (p50, p95, p99)

Traffic: requests per second

Errors: error rate (5xx / total)

Saturation: how full your resources are (CPU, memory, disk)

Structured events with context:

{"level":"error","msg":"payment failed","user_id":"123","error":"card_declined","ts":"2024-01-01T12:00:00Z"}

Follow a single request across multiple services. Each span has: service name, operation, duration, parent span.

SLI (Service Level Indicator): the metric (e.g. "99.2% of requests complete in < 200ms")

SLO (Service Level Objective): the target (e.g. "99.9% availability per month")

SLA (Service Level Agreement): the contract (e.g. "if we drop below 99.9%, you get credits")

Error budget: 100% - SLO = how much downtime you're allowed (99.9% = 43 min/month)