LLM Fundamentals
Transformers, tokenisation, temperature, context windows, and how LLMs actually work.
LLM Fundamentals
The Transformer Architecture
All modern LLMs are based on the Transformer architecture (Vaswani et al., 2017 — "Attention Is All You Need").
Key Components
Attention Mechanism Allows each token to "attend" to every other token in the sequence. The weight of attention is computed by:
Attention(Q, K, V) = softmax(QK^T / √d_k) · VMulti-Head Attention Multiple attention heads in parallel — each learns to attend to different types of relationships (syntax, coreference, semantics, etc.).
Feedforward Layers After attention, each position is processed by an identical feedforward network (usually 4× the model width). This is where most of the model's "knowledge" is stored.
Tokenisation
Text → tokens → token IDs → embeddings → transformer → logits → next token
Tokenisers don't split by words — they split by sub-word units (BPE, WordPiece, or SentencePiece):
Rule of thumb: ~1 token ≈ 0.75 words (English). Non-Latin scripts are often more tokens per word.
Key Parameters
| Parameter | What it controls |
|---|---|
| Temperature | Randomness. 0 = greedy (always pick max probability), 1 = sample from distribution, >1 = more random |
| Top-p (nucleus) | Sample from the smallest set of tokens whose cumulative probability ≥ p. Filters out low-prob tokens |
| Top-k | Sample only from the top-k tokens by probability |
| Max tokens | Maximum output length in tokens |
| Frequency penalty | Penalise tokens that have already appeared (reduces repetition) |
Context Window
The context window is the total number of tokens the model can "see" at once (input + output combined).
| Model family | Context window |
|---|---|
| GPT-3.5 | 16k tokens |
| GPT-4o | 128k tokens |
| Claude 3.5 | 200k tokens |
| Gemini 1.5 Pro | 1M tokens |
Inference: How Text Is Generated
LLMs generate text autoregressively — one token at a time:
This is why:
Emergent Capabilities
Larger models exhibit capabilities not present in smaller models — these are called emergent:
This is why "bigger is better" held for a long time — though now efficiency techniques (MoE, distillation, PEFT) challenge this.
Sign in to use the AI study buddy on this lesson.