The context problem
Part 1 ended with a flaw: words like "play" got one meaning, no matter the context. The fix would need something that could read an entire sentence and decide which words matter most. That idea had a name: attention.
Attention Is All You Need
Before 2017, models read text one word at a time, in order — and slowly forgot what came before. By the time the model reached the end of a long sentence, the beginning was a ghost.
I was building NLP for Swedish customers at Almvy and hitting this wall constantly. The models could focus on nearby words, but long-range context just dissolved.
In 2017, a team at Google asked: what if the model could look at every word at once? Not one at a time — all of them, in parallel. Every word gets to decide which other words matter most to it.
That's attention. And it changes everything — "play" finally means something different next to "football" than next to "piano."
They packaged this into a complete architecture called the Transformer. It had two halves: an Encoder that reads and understands the input, and a Decoder that generates new text.
The encoder processes the full input through multiple layers of attention, building a deeper representation at each step. The decoder uses its own attention layers to generate output one token at a time, while also cross-attending to the encoder's representation — connecting what it's writing to what it read.
But the Transformer wasn't just a translator. Google's T5 showed that if you frame every NLP task as text-to-text — translate this, summarize that, classify this — one architecture handles them all.
This was the blueprint. But researchers quickly asked: do we actually need both halves?
The Great Split: Readers vs Writers
The research world ripped the Transformer apart.
Google took the Encoder and built BERT. Encoders are bidirectional — they read forward and backward simultaneously. You train them by masking words in a sentence and forcing the model to guess what's missing. This gave BERT a deep understanding of language structure, and it shattered every reading comprehension benchmark overnight.
When BERT dropped, my English experiments suddenly worked. Swedish? Nothing — no one had trained a Swedish model. Then in September 2019, I woke up high as a kite after a collarbone surgery, saw ALBERT beating the average human in English reading comprehension, and made a decision: I'm not waiting for the universities. I'm building this for Swedish myself. I did — and it beat the average human on the Swedish SAT.
Remember the geometry from Part 1? Direction encoded relationship — Man to Woman was the same arrow as King to Queen. Distance encoded similarity. But that geometry was frozen. Every word got one position, forever.
Attention blows that wide open. The same geometry — directions, distances, relationships — but now it's rebuilt from scratch for every sentence. Each layer of attention reshapes the space: moving words closer when they're related in this context, pushing them apart when they're not. Layer after layer, thousands of attention scores redrawing the map. The geometry from Part 1 was the foundation. This is the foundation on steroids.
Watch the vectors move. Each layer reshapes the space — words that relate pull together, words that don't drift apart. Is it finding grammar? Meaning? Some pattern no human would name? Twelve layers deep, thousands of attention scores firing, and the geometry keeps changing. Nobody fully knows what it's learning in there. But whatever it is — it works.
But how does it work? What's actually happening inside each of those 12 layers? Let's open the hood.
Inside the Mechanism
Every word "looking at" every other word sounds magical. The math is surprisingly simple — built on one operation: the dot product.
{ }Deep dive: the actual math behind attentionUnderstandable for anyone who knows multiplication and has patienceclick to expand
Multiply each pair, add them up. High result = similar. Low or negative = unrelated. Attention uses this to measure how relevant one word is to another.
Think of it like a library. You walk in with a question — "I need something about helmets." Every book has a label — "Safety gear", "Sports", "History." You match your question against the labels, find the best matches, then read the content.
Each word produces three vectors from its embedding, using learned weight matrices:
- Query (Q) — "what am I looking for?"
- Key (K) — "what do I contain?"
- Value (V) — "what information do I carry?"
Same process for Key and Value — same embedding, different learned matrices:
Our example uses 8 dimensions compressed to 4. GPT-1 uses 768 → 64. 1 GPT-3 uses 12,288 → 128. 3 Same math, bigger numbers.
In practice, all words are processed simultaneously — one matrix multiplication:
Now every word has its Q, K, and V. "smacked"'s Query gets dot-producted with every word's Key to find which words matter. (The scores are divided by √d_k to keep softmax stable.)
The attention weights tell us HOW MUCH each word matters. Multiply each Value by its weight, add them up:
That's one attention head. The 8→4 compression loses information, so we run multiple heads in parallel — each learning different patterns. Their outputs concatenate back to 8:
But attention is only half of a transformer layer. The output needs to be stabilized and processed further.
Preserving the Signal
The attention output is valuable — but raw. Before anything else, we add the original input back and normalize:
This "add and normalize" pattern happens twice per layer. It's the transformer's immune system — keeping the signal healthy through 96 layers of processing.
The Thinking Step
Attention decided which words to mix. Now each word processes that mixture independently — through a small neural network:
After the FFN, the same add-and-normalize pattern repeats (adding back the pre-FFN values). That's one complete layer.
Putting It All Together
Does it actually learn? Watch a 2-layer transformer train live — right here in your browser. Same fish sentences from Part 1. Watch the attention patterns emerge from noise, the loss drop, and the model learn which words fit which contexts.
That's what you saw in the 3D visualization above — 12 layers of this, each one reshaping the geometry. Attention decides which words to mix. The FFN decides what to do with the mixture. Residual connections keep the signal alive. Layer by layer, the representations get richer.
OpenAI took the other half. They ripped out the decoder, threw away the encoder, and built GPT. It couldn't read as well as BERT — it lost every benchmark. But the decoder could do something the encoder couldn't: write. And OpenAI had a plan for that.