RAG Systems
Retrieval-Augmented Generation: chunking, embeddings, vector stores, retrieval strategies, and evaluation.
Retrieval-Augmented Generation (RAG)
Why RAG?
LLMs have a knowledge cutoff and don't know your private data. RAG solves this by:
This separates retrieval (what facts to look up) from generation (how to express them).
RAG Architecture
Query → [Embed query] → Vector search → Top-k chunks
↓
[Assemble prompt: system + chunks + query]
↓
LLM → AnswerDocument Processing: Chunking
Documents must be chunked before embedding. Chunking strategy is critical.
| Strategy | Description | Use when |
|---|---|---|
| Fixed-size | Split every N tokens | Simple baseline |
| Sentence | Split on sentence boundaries | Prose text |
| Recursive | Split on \n\n → \n → space, progressively | General purpose |
| Semantic | Split when embedding similarity drops | Dense technical docs |
| Document-aware | Respect headers/sections | Markdown, HTML, PDFs |
Embeddings
Embeddings are dense vector representations of text. Semantically similar text → similar vectors (small cosine distance).
Popular embedding models:
Embed both documents (at index time) and queries (at query time) with the same model.
Vector Stores
Store embeddings + metadata for fast approximate nearest-neighbour (ANN) search.
| Store | Best for |
|---|---|
| pgvector | Already using Postgres; production simplicity |
| Pinecone | Managed, large scale, no infra |
| Weaviate | Hybrid search (vector + keyword) |
| Qdrant | Fast, self-hosted, rich filtering |
| ChromaDB | Local dev and prototyping |
| Faiss | In-memory, CPU/GPU, research |
Retrieval Strategies
Dense retrieval: embedding similarity (handles semantic matches, synonyms) Sparse retrieval: BM25/TF-IDF (handles exact keyword matches, IDs, names) Hybrid: combine both, re-rank with a cross-encoder — best of both worlds
Re-ranking: after retrieving top-k candidates, use a cross-encoder to re-score chunk-query pairs more accurately. Cross-encoders are slower but more precise than bi-encoders.
HyDE (Hypothetical Document Embeddings): generate a hypothetical answer to the query, embed it, then search with that embedding. Often outperforms embedding the raw query for abstract questions.
RAG Evaluation
| Metric | What it measures |
|---|---|
| Context Precision | Of retrieved chunks, what fraction were actually relevant? |
| Context Recall | Of all relevant chunks, what fraction did we retrieve? |
| Faithfulness | Is the answer grounded in the retrieved context? |
| Answer Relevance | Does the answer address the query? |
Common Failure Modes
Sign in to use the AI study buddy on this lesson.