The Race to Scale
In Part 3, we saw the mechanism: predict the next token, roll the dice, repeat. Beautifully simple. But OpenAI didn't just build an elegant autocomplete — they built it at a scale nobody had attempted before. What followed was one of the most dramatic escalations in the history of technology.
In 2020, researchers discovered that model error follows a smooth power law. 4 Bigger model, better results. No cliff. No diminishing returns. And in 2022, DeepMind showed GPT-3 was actually undertrained 5 — the optimal strategy wasn't just more parameters, it was more data. Every lab in the world started running.
What does that kind of growth actually look like inside a model? Each building below is one GPT generation. The foundation bricks are attention heads — one per head, sized by the hidden dimension. The tower above is every transformer layer stacked on top. Watch GPT-1's modest 12-layer building get dwarfed as each generation adds more heads, wider layers, and deeper stacks.
Breaking the Wall
In Part 2 we saw how attention works: every word produces a Query, Key, and Value, dot-products find which words matter, and the weighted Values become the output. Every word attends to every other word — an n × n matrix of scores.
That matrix is the wall. 6 words = 36 scores. 1,000 words = 1 million. 1 million words = 1 trillion. Double the context, quadruple the cost. GPT-3 computes 9,216 of these matrices (96 heads × 96 layers), each one n × n. Modern models handle 1 million tokens of context. At that scale, full attention is physically impossible.
Breaking the Matrix
The first problem is the attention computation itself — the n × n score matrix.
Full Attention — the baseline
Standard self-attention computes a score between every pair of tokens. For a sequence of n tokens, that's an n × n matrix — one entry for every possible connection. It's complete: nothing is missed. But the cost is quadratic.
Sliding Window Attention
The simplest fix: limit each token to its w nearest neighbors. Mistral uses a window of 4,096 tokens. The cost drops from O(n²) to O(n × w) — linear in context length. 11
The trade-off is real: tokens outside the window are invisible. But information still propagates through layers. With 32 transformer blocks and a window of 4,096, the effective receptive field reaches ~131K tokens — each layer passes information one window-width further.
Flash Attention
Flash Attention doesn't change what the model computes — the math is identical to full attention. It changes how: by tiling the Q, K, V matrices into small blocks that fit in the GPU's fast on-chip SRAM (~20 MB), it avoids ever writing the full n × n attention matrix to slow HBM (GPU main memory). 7
The result: O(n) memory instead of O(n²), and 3× end-to-end speedup on GPT-2 (up to 7.6× on the attention computation alone). 7 The key insight is IO-awareness — the bottleneck isn't compute, it's memory bandwidth. Flash Attention computes exact results — not an approximation — with a fraction of the memory traffic. Today, virtually every large model uses Flash Attention. It's table stakes.
Shrinking the Cache
Flash Attention solved the training bottleneck. But there's a second wall that Flash doesn't touch.
When a model generates text — one token at a time — it caches every previous token's keys and values so it doesn't have to recompute them. This is the KV cache, and it grows linearly with every token produced. For a 96-head model generating a 128K-token response, that's 96 separate K and V tensors, each growing with every single token. Flash Attention can't shrink this. It's a completely different bottleneck — one that only matters during inference, not training.
Multi-Query Attention (MQA)
The bluntest fix: share a single set of keys and values across all 96 query heads. The KV cache drops by 96×. The trade-off: some quality degradation and training instability, since all heads now read from the same information.
Grouped-Query Attention (GQA)
The compromise. Instead of one shared KV head (MQA) or 96 independent ones (full MHA), GQA divides the query heads into groups — typically 8. Each group shares one set of keys and values. 6
Llama 3 uses GQA with 8 KV heads. 6 Mistral uses GQA with 8 KV heads. 11 It preserves most of the quality of full multi-head attention while capturing most of the speed gains of MQA — the sweet spot that the industry converged on.
Multi-Head Latent Attention (MLA)
DeepSeek took a different approach entirely. Instead of sharing KV heads, MLA compresses the keys and values into a learned low-dimensional latent space — 512 dimensions instead of 14,000. 12 The model learns what information to keep and what to discard.
The KV cache drops from 213 GB to 7.6 GB. Unlike MQA's blunt sharing, MLA preserves per-head expressiveness through the learned compression. DeepSeek V2, V3, and R1 all use MLA — it's arguably the most important attention innovation since Flash Attention.
Replacing Attention Entirely
Some researchers asked a more radical question: what if we skip attention altogether?
Attention
Mamba
Attention remembers everything — but the cost grows forever. Mamba summarizes into a fixed state — fast, but it can forget.
State-space models like Mamba process sequences in linear time — no n × n matrix, no KV cache. They maintain a fixed-size hidden state that gets updated with each token, like a rolling summary. The cost is constant per token regardless of how long the sequence is.
The catch: pure SSMs struggle with precise recall. If you need to find one specific fact buried in 100,000 tokens, the fixed-size state can't always hold it. Attention excels at exactly this — reaching back to any specific position.
The solution: hybrid architectures. NVIDIA's Nemotron 3 replaces most layers with Mamba-2 and keeps only a few GQA attention layers (with just 2 KV heads) for precise retrieval. 14 The result: 1M-token context with 3.3× higher throughput than a pure transformer of similar size.
In practice, modern models stack all three strategies. Llama 3 uses Flash Attention + GQA. 6 DeepSeek V3 combines Flash + MLA + MoE. 12 GLM-4 uses Flash Attention + GQA for 128K–1M token contexts. 13 Nemotron 3 uses Mamba-2 + GQA + MoE. 14 The quadratic wall didn't fall in one blow. It was chipped away from every angle until it stopped mattering.
Shrinking the Numbers
Even with faster attention, there's a blunter cost: memory. Every parameter is a floating-point number. At full precision (FP32), each one takes 4 bytes. GPT-3's 175 billion parameters at FP32 = 700 GB — more than any single GPU can hold.
The first trick: use smaller numbers during training. FP16 (16-bit floating point) cuts memory in half. But FP16 has a narrow dynamic range — gradients can overflow or underflow mid-training. BF16 (bfloat16) solved this: it keeps FP32's 8-bit exponent (same range) but shrinks the mantissa (less precision). The trade-off: you lose some decimal accuracy but the numbers never blow up. Google designed BF16 specifically for deep learning, and by 2022 it was the default for most large model training.
In practice, training uses both: the forward and backward passes run in BF16 for speed, but a master copy of the weights stays in FP32. The model thinks in low precision but remembers in full precision. This is mixed-precision training.
After training, you can compress further. INT8 quantization maps floating-point weights to 8-bit integers — 4× smaller than FP32, 2× smaller than FP16. Dettmers et al. showed this works on models up to 175B parameters with virtually no performance loss, using a clever trick: the ~0.1% of weights with extreme values stay in FP16, while the other 99.9% compress to INT8. 9
INT4 pushes further — 8× compression from FP32. GPTQ showed you can compress a 175B model to 3–4 bits per parameter and run it on a single GPU for the first time. 10
A 70-billion parameter model that once required a server cluster now fits on a laptop with a gaming GPU. Quantization didn't just make AI cheaper — it democratized it.
Drag the slider to see how model size and precision format change the memory bill — and which hardware can actually hold the result.
INT4 fits on a data center GPU. FP32 needs a whole rack.
What a token really is
By 2020, the AI world had split in two.
Encoders like BERT: narrow tasks, short contexts, safe, reliable. You fine-tuned one model per problem and slept well at night.
Decoders like GPT-3: could do almost anything. Not reliably, but the range was staggering. Poetry, Python, legal briefs, meatball recipes — all from one model, no fine-tuning required. The ultimate autocomplete — stunningly capable, completely unreliable.
But something else was quietly brewing in the architecture.
Every Transformer — encoder, decoder, text-to-text — speaks the same language: tokens. And a token is just a number. Nothing in the math requires it to represent a word.
We taught the rock to read. We taught it to write. We grew it until the world noticed. What happens when we teach it to listen? To see?