lesson

LLM Fundamentals

Transformers, tokenisation, temperature, context windows, and how LLMs actually work.

LLM Fundamentals

The Transformer Architecture

All modern LLMs are based on the Transformer architecture (Vaswani et al., 2017 — "Attention Is All You Need").

Key Components

Attention Mechanism Allows each token to "attend" to every other token in the sequence. The weight of attention is computed by:

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

Q (Query): What this token is looking for

K (Key): What other tokens offer

V (Value): The actual content to extract

Multi-Head Attention Multiple attention heads in parallel — each learns to attend to different types of relationships (syntax, coreference, semantics, etc.).

Feedforward Layers After attention, each position is processed by an identical feedforward network (usually 4× the model width). This is where most of the model's "knowledge" is stored.

Tokenisation

Text → tokens → token IDs → embeddings → transformer → logits → next token

Tokenisers don't split by words — they split by sub-word units (BPE, WordPiece, or SentencePiece):

"unhappiness" → ["un", "happiness"] (2 tokens)

"GPT" → ["G", "PT"] or ["GPT"] depending on the vocab

Numbers and code are often split into individual digits/characters

Rule of thumb: ~1 token ≈ 0.75 words (English). Non-Latin scripts are often more tokens per word.

Key Parameters

Parameter	What it controls
Temperature	Randomness. 0 = greedy (always pick max probability), 1 = sample from distribution, >1 = more random
Top-p (nucleus)	Sample from the smallest set of tokens whose cumulative probability ≥ p. Filters out low-prob tokens
Top-k	Sample only from the top-k tokens by probability
Max tokens	Maximum output length in tokens
Frequency penalty	Penalise tokens that have already appeared (reduces repetition)

Temperature 0 → deterministic, best for factual tasks, code. Temperature 0.7–1.0 → creative, diverse outputs.

Context Window

The context window is the total number of tokens the model can "see" at once (input + output combined).

Model family	Context window
GPT-3.5	16k tokens
GPT-4o	128k tokens
Claude 3.5	200k tokens
Gemini 1.5 Pro	1M tokens

Why it matters for engineers:

RAG: retrieved chunks must fit in context alongside the query + system prompt

Long documents: may need chunking/summarisation if they exceed the context

Cost: most APIs charge per input token — large contexts are expensive

Inference: How Text Is Generated

LLMs generate text autoregressively — one token at a time:

Tokenise input

Run transformer forward pass → probability distribution over vocab

Sample next token (using temperature/top-p/top-k)

Append token to sequence, repeat from step 2

This is why:

Generation is slower than classification (sequential, not parallel)

You can't easily "undo" a token

Streaming responses work by emitting tokens one-by-one as they're generated

Emergent Capabilities

Larger models exhibit capabilities not present in smaller models — these are called emergent:

Chain-of-thought reasoning (appears ~100B+ parameters)

Few-shot learning (in-context examples)

Instruction following

Code generation

This is why "bigger is better" held for a long time — though now efficiency techniques (MoE, distillation, PEFT) challenge this.

LLM Fundamentals

The Transformer Architecture

Key Components

Tokenisation

Key Parameters

Context Window

Inference: How Text Is Generated

Emergent Capabilities

Resources