lesson

RAG Systems

Retrieval-Augmented Generation: chunking, embeddings, vector stores, retrieval strategies, and evaluation.

Retrieval-Augmented Generation (RAG)

Why RAG?

LLMs have a knowledge cutoff and don't know your private data. RAG solves this by:

  • Retrieving relevant documents from a knowledge base at query time
  • Injecting them into the prompt as context
  • Having the model answer based on the retrieved content
  • This separates retrieval (what facts to look up) from generation (how to express them).


    RAG Architecture

    Query → [Embed query] → Vector search → Top-k chunks
                                                ↓
                                   [Assemble prompt: system + chunks + query]
                                                ↓
                                        LLM → Answer


    Document Processing: Chunking

    Documents must be chunked before embedding. Chunking strategy is critical.

    StrategyDescriptionUse when
    Fixed-sizeSplit every N tokensSimple baseline
    SentenceSplit on sentence boundariesProse text
    RecursiveSplit on \n\n → \n → space, progressivelyGeneral purpose
    SemanticSplit when embedding similarity dropsDense technical docs
    Document-awareRespect headers/sectionsMarkdown, HTML, PDFs
    Chunk overlap: include 10-20% of the previous chunk at the start of each new chunk to avoid cutting context at boundaries.


    Embeddings

    Embeddings are dense vector representations of text. Semantically similar text → similar vectors (small cosine distance).

    Popular embedding models:

  • text-embedding-3-small / large (OpenAI) — 1536/3072 dimensions
  • text-embedding-004 (Google)
  • all-MiniLM-L6-v2 (Sentence Transformers, free, fast, runs locally)
  • voyage-3 (Voyage AI) — strong for code and technical text
  • Embed both documents (at index time) and queries (at query time) with the same model.


    Vector Stores

    Store embeddings + metadata for fast approximate nearest-neighbour (ANN) search.

    StoreBest for
    pgvectorAlready using Postgres; production simplicity
    PineconeManaged, large scale, no infra
    WeaviateHybrid search (vector + keyword)
    QdrantFast, self-hosted, rich filtering
    ChromaDBLocal dev and prototyping
    FaissIn-memory, CPU/GPU, research
    ANN algorithms: HNSW (Hierarchical Navigable Small World) is the most common — sub-linear search time with high recall.


    Retrieval Strategies

    Dense retrieval: embedding similarity (handles semantic matches, synonyms) Sparse retrieval: BM25/TF-IDF (handles exact keyword matches, IDs, names) Hybrid: combine both, re-rank with a cross-encoder — best of both worlds

    Re-ranking: after retrieving top-k candidates, use a cross-encoder to re-score chunk-query pairs more accurately. Cross-encoders are slower but more precise than bi-encoders.

    HyDE (Hypothetical Document Embeddings): generate a hypothetical answer to the query, embed it, then search with that embedding. Often outperforms embedding the raw query for abstract questions.


    RAG Evaluation

    MetricWhat it measures
    Context PrecisionOf retrieved chunks, what fraction were actually relevant?
    Context RecallOf all relevant chunks, what fraction did we retrieve?
    FaithfulnessIs the answer grounded in the retrieved context?
    Answer RelevanceDoes the answer address the query?
    Tools: Ragas (Python library for automated RAG evaluation), LangSmith (tracing + eval).


    Common Failure Modes

  • Chunk too large → relevant info diluted by irrelevant context
  • Chunk too small → context split across chunks, retrieval misses
  • Wrong embedding model → semantic mismatch between query and doc vectors
  • Missing metadata filtering → retrieve from wrong data source
  • No re-ranking → top-1 by cosine similarity isn't always the most relevant
  • Not enough retrieved chunks → answer lacks necessary context
  • Sign in to use the AI study buddy on this lesson.

    Resources