lesson

LLM Fundamentals

Transformers, tokenisation, temperature, context windows, and how LLMs actually work.

LLM Fundamentals

The Transformer Architecture

All modern LLMs are based on the Transformer architecture (Vaswani et al., 2017 — "Attention Is All You Need").

Key Components

Attention Mechanism Allows each token to "attend" to every other token in the sequence. The weight of attention is computed by:

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

  • Q (Query): What this token is looking for
  • K (Key): What other tokens offer
  • V (Value): The actual content to extract
  • Multi-Head Attention Multiple attention heads in parallel — each learns to attend to different types of relationships (syntax, coreference, semantics, etc.).

    Feedforward Layers After attention, each position is processed by an identical feedforward network (usually 4× the model width). This is where most of the model's "knowledge" is stored.


    Tokenisation

    Text → tokens → token IDs → embeddings → transformer → logits → next token

    Tokenisers don't split by words — they split by sub-word units (BPE, WordPiece, or SentencePiece):

  • "unhappiness" → ["un", "happiness"] (2 tokens)
  • "GPT" → ["G", "PT"] or ["GPT"] depending on the vocab
  • Numbers and code are often split into individual digits/characters
  • Rule of thumb: ~1 token ≈ 0.75 words (English). Non-Latin scripts are often more tokens per word.


    Key Parameters

    ParameterWhat it controls
    TemperatureRandomness. 0 = greedy (always pick max probability), 1 = sample from distribution, >1 = more random
    Top-p (nucleus)Sample from the smallest set of tokens whose cumulative probability ≥ p. Filters out low-prob tokens
    Top-kSample only from the top-k tokens by probability
    Max tokensMaximum output length in tokens
    Frequency penaltyPenalise tokens that have already appeared (reduces repetition)
    Temperature 0 → deterministic, best for factual tasks, code. Temperature 0.7–1.0 → creative, diverse outputs.


    Context Window

    The context window is the total number of tokens the model can "see" at once (input + output combined).

    Model familyContext window
    GPT-3.516k tokens
    GPT-4o128k tokens
    Claude 3.5200k tokens
    Gemini 1.5 Pro1M tokens
    Why it matters for engineers:
  • RAG: retrieved chunks must fit in context alongside the query + system prompt
  • Long documents: may need chunking/summarisation if they exceed the context
  • Cost: most APIs charge per input token — large contexts are expensive

  • Inference: How Text Is Generated

    LLMs generate text autoregressively — one token at a time:

  • Tokenise input
  • Run transformer forward pass → probability distribution over vocab
  • Sample next token (using temperature/top-p/top-k)
  • Append token to sequence, repeat from step 2
  • This is why:

  • Generation is slower than classification (sequential, not parallel)
  • You can't easily "undo" a token
  • Streaming responses work by emitting tokens one-by-one as they're generated

  • Emergent Capabilities

    Larger models exhibit capabilities not present in smaller models — these are called emergent:

  • Chain-of-thought reasoning (appears ~100B+ parameters)
  • Few-shot learning (in-context examples)
  • Instruction following
  • Code generation
  • This is why "bigger is better" held for a long time — though now efficiency techniques (MoE, distillation, PEFT) challenge this.

    Sign in to use the AI study buddy on this lesson.

    Resources