lesson

RAG Systems

Retrieval-Augmented Generation: chunking, embeddings, vector stores, retrieval strategies, and evaluation.

Retrieval-Augmented Generation (RAG)

Why RAG?

LLMs have a knowledge cutoff and don't know your private data. RAG solves this by:

Retrieving relevant documents from a knowledge base at query time

Injecting them into the prompt as context

Having the model answer based on the retrieved content

This separates retrieval (what facts to look up) from generation (how to express them).

RAG Architecture

Query → [Embed query] → Vector search → Top-k chunks
                                            ↓
                               [Assemble prompt: system + chunks + query]
                                            ↓
                                    LLM → Answer

Document Processing: Chunking

Documents must be chunked before embedding. Chunking strategy is critical.

Strategy	Description	Use when
Fixed-size	Split every N tokens	Simple baseline
Sentence	Split on sentence boundaries	Prose text
Recursive	Split on \n\n → \n → space, progressively	General purpose
Semantic	Split when embedding similarity drops	Dense technical docs
Document-aware	Respect headers/sections	Markdown, HTML, PDFs

Chunk overlap: include 10-20% of the previous chunk at the start of each new chunk to avoid cutting context at boundaries.

Embeddings

Embeddings are dense vector representations of text. Semantically similar text → similar vectors (small cosine distance).

Popular embedding models:

text-embedding-3-small / large (OpenAI) — 1536/3072 dimensions

text-embedding-004 (Google)

all-MiniLM-L6-v2 (Sentence Transformers, free, fast, runs locally)

voyage-3 (Voyage AI) — strong for code and technical text

Embed both documents (at index time) and queries (at query time) with the same model.

Vector Stores

Store embeddings + metadata for fast approximate nearest-neighbour (ANN) search.

Store	Best for
pgvector	Already using Postgres; production simplicity
Pinecone	Managed, large scale, no infra
Weaviate	Hybrid search (vector + keyword)
Qdrant	Fast, self-hosted, rich filtering
ChromaDB	Local dev and prototyping
Faiss	In-memory, CPU/GPU, research

ANN algorithms: HNSW (Hierarchical Navigable Small World) is the most common — sub-linear search time with high recall.

Retrieval Strategies

Dense retrieval: embedding similarity (handles semantic matches, synonyms) Sparse retrieval: BM25/TF-IDF (handles exact keyword matches, IDs, names) Hybrid: combine both, re-rank with a cross-encoder — best of both worlds

Re-ranking: after retrieving top-k candidates, use a cross-encoder to re-score chunk-query pairs more accurately. Cross-encoders are slower but more precise than bi-encoders.

HyDE (Hypothetical Document Embeddings): generate a hypothetical answer to the query, embed it, then search with that embedding. Often outperforms embedding the raw query for abstract questions.

RAG Evaluation

Metric	What it measures
Context Precision	Of retrieved chunks, what fraction were actually relevant?
Context Recall	Of all relevant chunks, what fraction did we retrieve?
Faithfulness	Is the answer grounded in the retrieved context?
Answer Relevance	Does the answer address the query?

Tools: Ragas (Python library for automated RAG evaluation), LangSmith (tracing + eval).

Common Failure Modes

Chunk too large → relevant info diluted by irrelevant context

Chunk too small → context split across chunks, retrieval misses

Wrong embedding model → semantic mismatch between query and doc vectors

Missing metadata filtering → retrieve from wrong data source

No re-ranking → top-1 by cosine similarity isn't always the most relevant

Not enough retrieved chunks → answer lacks necessary context