System Design: Production RAG Pipeline
Design a production-grade RAG system: ingestion, indexing, retrieval, generation, evaluation, and monitoring.
System Design: Production RAG Pipeline
The Problem
Design a system that lets customer support agents query a 50,000-document internal knowledge base in natural language and get accurate, cited answers. Latency must be < 3 seconds P95. The knowledge base updates daily.
Requirements Clarification
Architecture
Ingestion Pipeline (daily batch):
Raw docs → Parser → Chunker → Embedding model → pgvector (upsert)Query Pipeline (real-time):
Query → Embed → Vector search (top-20) → Re-rank (top-5)
→ Assemble prompt → LLM → Answer + citations
→ Response cache check (Redis)
Ingestion Pipeline
Parser: Extract clean text from PDF/DOCX/HTML. Libraries: Apache Tika, pdfminer, BeautifulSoup.
Chunker: Recursive chunking with 512-token chunks, 64-token overlap. Preserve document metadata (source, section, date).
Embedder: text-embedding-3-small (OpenAI) — fast, cheap, 1536-dim. Batch embed 100 chunks/request.
Storage: pgvector extension on Postgres. Stores: chunk_id, doc_id, chunk_text, embedding (vector), metadata (JSONB).
Update strategy: Daily upsert on doc_id + chunk_index. Delete chunks for removed documents.
Query Pipeline
Latency Budget (< 3s P95)
| Step | Budget |
|---|---|
| Query embedding | 50ms |
| Vector search | 30ms |
| Re-ranking | 200ms |
| LLM generation | 1.5-2s |
| Network + overhead | 200ms |
| Total | ~2s |
Evaluation & Monitoring
Offline eval: Ragas metrics on a golden test set (100 human-labelled query-answer pairs).
Online monitoring:
Failure Modes & Mitigations
| Failure | Mitigation |
|---|---|
| Retrieval misses relevant chunk | Hybrid search + re-ranking; check chunk size |
| LLM hallucinates beyond context | Faithfulness check; constrain with structured output |
| Latency spike from LLM | Response caching; streaming; fallback to faster model |
| Knowledge base stale | Daily freshness check; source metadata with last_updated |
| Prompt injection via document | Sanitise document content; don't give LLM privileged actions |
Your design notes
Work through this problem yourself before reading the walkthrough above. Your notes are stored locally and not submitted anywhere — only sent to the AI when you click Review.