system design

System Design: Monitoring Platform

Design a metrics collection and alerting system like Datadog.

System Design: Monitoring Platform

The Problem

Design a metrics collection and alerting system for a company with 500 microservices. Each service emits 100 metrics at 10-second intervals. Users need dashboards and configurable alerts.

Scale Math

  • 500 services x 100 metrics x 6/min = 300K data points/min = 5K/sec
  • 30 days retention at 10s granularity = ~13B data points
  • Architecture

    Services → Agent (StatsD/OTel) → Kafka → Ingestion Workers
                                                  ↓
                                         Time-Series DB (InfluxDB / TimescaleDB)
                                                  ↓
                                         Query API → Dashboard UI
                                                  ↓
                                         Alert Evaluator → PagerDuty / Slack

    Key Decisions

  • Time-series DB over relational — optimised for append-heavy, time-range queries
  • Downsampling: keep 10s data for 7 days, 1min for 30 days, 1hr for 1 year
  • Alert evaluation: pull-based (query DB every 30s) vs push-based (stream processing on ingest)
  • Multi-tenancy: separate data by team/service using labels, not separate databases
  • Your design notes

    Work through this problem yourself before reading the walkthrough above. Your notes are stored locally and not submitted anywhere — only sent to the AI when you click Review.