system design

System Design: Production RAG Pipeline

Design a production-grade RAG system: ingestion, indexing, retrieval, generation, evaluation, and monitoring.

System Design: Production RAG Pipeline

The Problem

Design a system that lets customer support agents query a 50,000-document internal knowledge base in natural language and get accurate, cited answers. Latency must be < 3 seconds P95. The knowledge base updates daily.

Requirements Clarification

Accuracy requirement: Can the system be wrong sometimes, or must it be near-perfect?

Latency: < 3s P95 — this constrains which models and retrieval strategies we can use

Volume: How many queries/day? (determines infrastructure sizing)

Privacy: Can documents leave your infrastructure? (managed API vs self-hosted)

Update frequency: Daily — incremental indexing vs full re-index

Architecture

Ingestion Pipeline (daily batch):
  Raw docs → Parser → Chunker → Embedding model → pgvector (upsert)Query Pipeline (real-time):
  Query → Embed → Vector search (top-20) → Re-rank (top-5)
        → Assemble prompt → LLM → Answer + citations
        → Response cache check (Redis)

Ingestion Pipeline

Parser: Extract clean text from PDF/DOCX/HTML. Libraries: Apache Tika, pdfminer, BeautifulSoup.

Chunker: Recursive chunking with 512-token chunks, 64-token overlap. Preserve document metadata (source, section, date).

Embedder: text-embedding-3-small (OpenAI) — fast, cheap, 1536-dim. Batch embed 100 chunks/request.

Storage: pgvector extension on Postgres. Stores: chunk_id, doc_id, chunk_text, embedding (vector), metadata (JSONB).

Update strategy: Daily upsert on doc_id + chunk_index. Delete chunks for removed documents.

Query Pipeline

Embed query — same model as documents (critical: model mismatch breaks retrieval)

Hybrid search — vector similarity (semantic) + BM25 (keyword), merge with RRF

Re-rank — cross-encoder (Cohere Rerank / local model) narrows top-20 → top-5

Prompt assembly — system prompt + top-5 chunks (with source metadata) + query

LLM call — GPT-4o or Claude 3.5 (strong instruction following, low hallucination rate)

Citation extraction — structured output: answer + list of source document IDs used

Latency Budget (< 3s P95)

Step	Budget
Query embedding	50ms
Vector search	30ms
Re-ranking	200ms
LLM generation	1.5-2s
Network + overhead	200ms
Total	~2s

Optimisations: streaming LLM response (user sees tokens immediately), Redis cache on common queries (cache hit → 50ms), async embedding.

Evaluation & Monitoring

Offline eval: Ragas metrics on a golden test set (100 human-labelled query-answer pairs).

Faithfulness > 0.9 (answers must be grounded in retrieved context)

Context Recall > 0.8 (retrieving relevant chunks)

Answer Relevance > 0.85

Online monitoring:

P50/P95/P99 latency per stage (embed / retrieve / rerank / LLM)

Hallucination rate (LLM judge or human sampling)

User thumbs up/down signal → feeds back into eval set

Failure Modes & Mitigations

Failure	Mitigation
Retrieval misses relevant chunk	Hybrid search + re-ranking; check chunk size
LLM hallucinates beyond context	Faithfulness check; constrain with structured output
Latency spike from LLM	Response caching; streaming; fallback to faster model
Knowledge base stale	Daily freshness check; source metadata with last_updated
Prompt injection via document	Sanitise document content; don't give LLM privileged actions

Your design notes

Work through this problem yourself before reading the walkthrough above. Your notes are stored locally and not submitted anywhere — only sent to the AI when you click Review.