system design

System Design: Production RAG Pipeline

Design a production-grade RAG system: ingestion, indexing, retrieval, generation, evaluation, and monitoring.

System Design: Production RAG Pipeline

The Problem

Design a system that lets customer support agents query a 50,000-document internal knowledge base in natural language and get accurate, cited answers. Latency must be < 3 seconds P95. The knowledge base updates daily.


Requirements Clarification

  • Accuracy requirement: Can the system be wrong sometimes, or must it be near-perfect?
  • Latency: < 3s P95 — this constrains which models and retrieval strategies we can use
  • Volume: How many queries/day? (determines infrastructure sizing)
  • Privacy: Can documents leave your infrastructure? (managed API vs self-hosted)
  • Update frequency: Daily — incremental indexing vs full re-index

  • Architecture

    Ingestion Pipeline (daily batch):
      Raw docs → Parser → Chunker → Embedding model → pgvector (upsert)

    Query Pipeline (real-time): Query → Embed → Vector search (top-20) → Re-rank (top-5) → Assemble prompt → LLM → Answer + citations → Response cache check (Redis)


    Ingestion Pipeline

    Parser: Extract clean text from PDF/DOCX/HTML. Libraries: Apache Tika, pdfminer, BeautifulSoup.

    Chunker: Recursive chunking with 512-token chunks, 64-token overlap. Preserve document metadata (source, section, date).

    Embedder: text-embedding-3-small (OpenAI) — fast, cheap, 1536-dim. Batch embed 100 chunks/request.

    Storage: pgvector extension on Postgres. Stores: chunk_id, doc_id, chunk_text, embedding (vector), metadata (JSONB).

    Update strategy: Daily upsert on doc_id + chunk_index. Delete chunks for removed documents.


    Query Pipeline

  • Embed query — same model as documents (critical: model mismatch breaks retrieval)
  • Hybrid search — vector similarity (semantic) + BM25 (keyword), merge with RRF
  • Re-rank — cross-encoder (Cohere Rerank / local model) narrows top-20 → top-5
  • Prompt assembly — system prompt + top-5 chunks (with source metadata) + query
  • LLM call — GPT-4o or Claude 3.5 (strong instruction following, low hallucination rate)
  • Citation extraction — structured output: answer + list of source document IDs used

  • Latency Budget (< 3s P95)

    StepBudget
    Query embedding50ms
    Vector search30ms
    Re-ranking200ms
    LLM generation1.5-2s
    Network + overhead200ms
    Total~2s
    Optimisations: streaming LLM response (user sees tokens immediately), Redis cache on common queries (cache hit → 50ms), async embedding.


    Evaluation & Monitoring

    Offline eval: Ragas metrics on a golden test set (100 human-labelled query-answer pairs).

  • Faithfulness > 0.9 (answers must be grounded in retrieved context)
  • Context Recall > 0.8 (retrieving relevant chunks)
  • Answer Relevance > 0.85
  • Online monitoring:

  • P50/P95/P99 latency per stage (embed / retrieve / rerank / LLM)
  • Hallucination rate (LLM judge or human sampling)
  • User thumbs up/down signal → feeds back into eval set

  • Failure Modes & Mitigations

    FailureMitigation
    Retrieval misses relevant chunkHybrid search + re-ranking; check chunk size
    LLM hallucinates beyond contextFaithfulness check; constrain with structured output
    Latency spike from LLMResponse caching; streaming; fallback to faster model
    Knowledge base staleDaily freshness check; source metadata with last_updated
    Prompt injection via documentSanitise document content; don't give LLM privileged actions

    Your design notes

    Work through this problem yourself before reading the walkthrough above. Your notes are stored locally and not submitted anywhere — only sent to the AI when you click Review.