How to teach a rock to understand

The context problem

Part 1 ended with a flaw: words like "play" got one meaning, no matter the context. The fix would need something that could read an entire sentence and decide which words matter most. That idea had a name: attention.

Attention Is All You Need

Before 2017, models read text one word at a time, in order — and slowly forgot what came before. By the time the model reached the end of a long sentence, the beginning was a ghost.

Theoldprofessorwhohadtaughtattheuniversityforoverthirtyyearsfinallydecidedtoretireafterthelongsemesterended

By word 20, word 1 is a ghost.

I was building NLP for Swedish customers at Almvy and hitting this wall constantly. The models could focus on nearby words, but long-range context just dissolved.

In 2017, a team at Google asked: what if the model could look at every word at once? Not one at a time — all of them, in parallel. Every word gets to decide which other words matter most to it.

Shewantedtoplaythepianobutthefootballgamewasstarting

That's attention. And it changes everything — "play" finally means something different next to "football" than next to "piano."

They packaged this into a complete architecture called the Transformer. It had two halves: an Encoder that reads and understands the input, and a Decoder that generates new text.

The encoder processes the full input through multiple layers of attention, building a deeper representation at each step. The decoder uses its own attention layers to generate output one token at a time, while also cross-attending to the encoder's representation — connecting what it's writing to what it read.

But the Transformer wasn't just a translator. Google's T5 showed that if you frame every NLP task as text-to-text — translate this, summarize that, classify this — one architecture handles them all.

Translate

translate: Mon chat est roi

My cat is king

Same model. Different tasks. Just change the prefix.

This was the blueprint. But researchers quickly asked: do we actually need both halves?

The Great Split: Readers vs Writers

The research world ripped the Transformer apart.

Google took the Encoder and built BERT. Encoders are bidirectional — they read forward and backward simultaneously. You train them by masking words in a sentence and forcing the model to guess what's missing. This gave BERT a deep understanding of language structure, and it shattered every reading comprehension benchmark overnight.

When BERT dropped, my English experiments suddenly worked. Swedish? Nothing — no one had trained a Swedish model. Then in September 2019, I woke up high as a kite after a collarbone surgery, saw ALBERT beating the average human in English reading comprehension, and made a decision: I'm not waiting for the universities. I'm building this for Swedish myself. I did — and it beat the average human on the Swedish SAT.

Remember the geometry from Part 1? Direction encoded relationship — Man to Woman was the same arrow as King to Queen. Distance encoded similarity. But that geometry was frozen. Every word got one position, forever.

Attention blows that wide open. The same geometry — directions, distances, relationships — but now it's rebuilt from scratch for every sentence. Each layer of attention reshapes the space: moving words closer when they're related in this context, pushing them apart when they're not. Layer after layer, thousands of attention scores redrawing the map. The geometry from Part 1 was the foundation. This is the foundation on steroids.

Watch the vectors move. Each layer reshapes the space — words that relate pull together, words that don't drift apart. Is it finding grammar? Meaning? Some pattern no human would name? Twelve layers deep, thousands of attention scores firing, and the geometry keeps changing. Nobody fully knows what it's learning in there. But whatever it is — it works.

But how does it work? What's actually happening inside each of those 12 layers? Let's open the hood.

Inside the Mechanism

Every word "looking at" every other word sounds magical. The math is surprisingly simple — built on one operation: the dot product.

{ }Deep dive: the actual math behind attentionUnderstandable for anyone who knows multiplication and has patienceclick to expand

Two vectors — 4 numbers each:

0.8

0.3

0.6

0.9

0.7

0.4

0.5

0.8

Multiply each pair:

0.8×0.7=0.56+0.3×0.4=0.12+0.6×0.5=0.30+0.9×0.8=0.72

Sum them all:

0.56 + 0.12 + 0.30 + 0.72 = 1.70

High dot product = vectors point the same direction = similar. Low or negative = different. This is how attention decides which words are relevant to each other.

Multiply each pair, add them up. High result = similar. Low or negative = unrelated. Attention uses this to measure how relevant one word is to another.

Think of it like a library. You walk in with a question — "I need something about helmets." Every book has a label — "Safety gear", "Sports", "History." You match your question against the labels, find the best matches, then read the content.

Each word produces three vectors from its embedding, using learned weight matrices:

Query (Q) — "what am I looking for?"
Key (K) — "what do I contain?"
Value (V) — "what information do I carry?"

Step 1: The word "smacked" as 8 numbers

smacked =

0.68

0.23

0.91

0.45

0.12

0.77

0.34

0.56

Step 2: Multiply by W_Q — the Query Weight matrix. These 32 weights are learned during training. The matrix compresses 8 dimensions down to 4 — extracting just the information needed for the query.

embedding

0.7

0.2

0.9

0.5

0.1

0.8

0.3

0.6

W_Q

0.3

0.1

-0.2

0.5

0.7

-0.4

0.6

0.2

-0.1

0.8

0.3

-0.3

0.4

0.2

-0.5

0.7

0.6

-0.1

0.4

0.1

-0.3

0.5

0.2

0.8

0.2

0.3

-0.1

-0.4

0.5

-0.2

0.7

0.3

Step 3: Each output value = dot product (multiply + sum)

That's the Query — what this word is looking for in other words.

Same process for Key and Value — same embedding, different learned matrices:

The Key — same embedding × a different learned weight matrix W_K:

embedding

0.7

0.2

0.9

0.5

0.1

0.8

0.3

0.6

W_K

0.1

0.6

-0.3

0.4

-0.2

0.3

0.7

0.1

0.5

-0.1

0.2

0.6

0.3

0.4

-0.4

-0.2

0.7

0.2

0.1

0.5

-0.1

0.5

0.3

-0.3

0.4

-0.3

0.6

0.2

0.1

-0.2

0.7

0.87

0.93

0.29

1.04

"What do I contain?"

The Value — same embedding × yet another learned weight matrix W_V:

embedding

0.7

0.2

0.9

0.5

0.1

0.8

0.3

0.6

W_V

-0.2

0.4

0.5

0.1

0.3

-0.1

0.2

0.6

0.3

-0.3

0.4

-0.1

0.7

0.1

-0.2

0.4

0.2

0.6

0.3

0.2

-0.4

0.3

0.5

0.1

-0.2

0.7

-0.3

0.6

0.4

0.2

0.64

0.92

0.62

1.25

"What information do I carry?"

Same embedding, three different weight matrices → three different vectors. The matrices are what the model learns during training.

Our example uses 8 dimensions compressed to 4. GPT-1 uses 768 → 64. ¹ GPT-3 uses 12,288 → 128. ³ Same math, bigger numbers.

In practice, all words are processed simultaneously — one matrix multiplication:

All 6 embeddings packed into one matrix (6 words × 8 dims):

the

0.4

0.2

0.7

0.3

0.6

0.9

0.2

0.6

helmet

0.9

0.3

0.7

0.8

0.1

0.5

0.8

0.4

smacked

0.7

0.2

0.9

0.5

0.1

0.8

0.3

0.6

0.3

0.8

0.4

0.1

0.6

0.4

0.7

0.2

the

0.5

0.1

0.4

0.7

0.5

0.3

0.6

0.8

ground

0.8

0.6

0.3

0.9

0.4

0.7

0.2

0.7

Multiply all at once — not word by word, one matrix operation:

W_Q

0.7

0.9

1.1

the

1.1

1.0

0.1

1.1

helmet

0.6

1.2

0.6

1.1

smacked

1.2

0.4

0.9

0.4

1.2

0.5

0.9

the

1.4

0.5

0.6

1.8

ground

Same X × different weights → K:

W_K

1.0

0.9

0.3

1.1

the

1.0

0.3

1.0

helmet

0.9

0.3

1.0

smacked

0.9

0.5

1.1

0.9

1.1

0.7

0.0

1.2

the

0.8

1.4

0.0

0.9

ground

And again → V:

W_V

0.7

0.9

1.3

the

0.6

1.3

0.6

1.3

helmet

0.6

0.9

0.6

1.3

smacked

1.1

0.4

0.6

1.6

0.4

1.4

0.8

1.0

the

0.3

1.3

1.2

1.1

ground

Three matrix multiplications — that's it. All words processed in parallel. This is why transformers are fast: GPUs are built for matrix multiplication. RNNs process words one at a time. Transformers process them all at once.

Now every word has its Q, K, and V. "smacked"'s Query gets dot-producted with every word's Key to find which words matter. (The scores are divided by √d_k to keep softmax stable.)

Step 1: Dot product "smacked"'s Query with every Key → how relevant is each word?

Q: smacked

0.64

1.16

0.61

1.09

K: the

0.98

0.91

0.35

1.06

0.64×0.98+1.16×0.91+0.61×0.35+1.09×1.06=

3.05

K: helmet

1.05

1.00

0.29

0.98

0.64×1.05+1.16×1.00+0.61×0.29+1.09×0.98=

3.07

K: smacked

0.87

0.93

0.29

1.04

0.64×0.87+1.16×0.93+0.61×0.29+1.09×1.04=

2.94

K: to

0.85

0.55

1.09

0.90

0.64×0.85+1.16×0.55+0.61×1.09+1.09×0.90=

2.82

K: the

1.15

0.72

0.02

1.17

0.64×1.15+1.16×0.72+0.61×0.02+1.09×1.17=

2.87

K: ground

0.82

1.42

0.04

0.90

0.64×0.82+1.16×1.42+0.61×0.04+1.09×0.90=

3.17

Step 2: Divide by √d_k — the square root of the key dimension (√4 = 2.0) — to keep scores from exploding, then softmax → probabilities

the

17%

helmet

17%

smacked

16%

15%

the

16%

ground

18%

"smacked" pays 18% attention to "ground"

The attention weights tell us HOW MUCH each word matters. Multiply each Value by its weight, add them up:

Multiply each word's Value by its attention weight:

Add all the weighted values together:

[0.05 0.25 0.24 0.21]← ground (20%)

+[0.12 0.23 0.11 0.23]← helmet (18%)

+[0.12 0.13 0.15 0.22]← the (18%)

+[0.10 0.15 0.10 0.20]← smacked (16%)

+[0.05 0.20 0.12 0.14]← the (15%)

+[0.15 0.06 0.09 0.22]← to (14%)

=[0.59 1.02 0.81 1.23]← output for "smacked"

Attention output for "smacked" =

0.59

1.02

0.81

1.23

That's one attention head. The 8→4 compression loses information, so we run multiple heads in parallel — each learning different patterns. Their outputs concatenate back to 8:

Every word produces an attention output. Here's Head 1 (what we just computed):

the→

0.63

0.98

0.79

1.25

helmet→

0.56

1.06

0.82

1.20

smacked→

0.59

1.02

0.81

1.23

to→

0.66

0.95

0.77

1.27

the→

0.61

1.01

0.79

1.23

ground→

0.61

1.02

0.78

1.23

6 words × 4 values = 24 outputs. Required 36 attention scores (6×6).

Head 2 has different W_Q, W_K, W_V — learns different patterns:

the→

0.39

0.35

0.42

helmet→

0.42

0.37

0.35

0.45

smacked→

0.43

0.35

0.33

0.45

to→

0.40

0.35

0.34

0.43

the→

0.38

0.35

0.41

ground→

0.43

0.35

0.34

0.45

Concatenate both heads → back to 8 dimensions (the original embedding size):

the→

0.6

1.0

0.8

1.3

0.4

0.3

0.4

0.6

1.0

0.8

1.3

0.4

0.3

0.4

helmet→

0.6

1.1

0.8

1.2

0.4

0.6

1.1

0.8

1.2

0.4

smacked→

0.6

1.0

0.8

1.2

0.4

0.3

0.4

0.6

1.0

0.8

1.2

0.4

0.3

0.4

to→

0.7

0.9

0.8

1.3

0.4

0.3

0.4

0.7

0.9

0.8

1.3

0.4

0.3

0.4

the→

0.6

1.0

0.8

1.2

0.4

0.3

0.4

0.6

1.0

0.8

1.2

0.4

0.3

0.4

ground→

0.6

1.0

0.8

1.2

0.4

0.3

0.5

0.6

1.0

0.8

1.2

0.4

0.3

0.5

4 + 4 = 8 dimensions — the original embedding size. Each word now carries context from the words it attended to.

GPT-3 uses 96 heads × 128 dimensions each = 12,288. Every head computes its own 6×6 attention matrix. That's 96 matrices per layer, across 96 layers. :source-ref{n=3}

But attention is only half of a transformer layer. The output needs to be stabilized and processed further.

Preserving the Signal

The attention output is valuable — but raw. Before anything else, we add the original input back and normalize:

Two things go in — the original input and the attention output:

original

0.68

0.23

0.91

0.45

0.12

0.77

0.34

0.56

attention

0.59

1.02

0.81

1.23

0.74

0.72

0.98

Add — element by element:

0.68+0.59=1.27,

0.23+1.02=1.25,

0.91+0.81=1.72,

0.45+1.23=1.68,

0.12+0.74=0.86,

0.77+0.74=1.51,

0.34+0.72=1.06,

0.56+0.98=1.54

This "skip connection" preserves the original signal. Without it, information dissolves through 96 layers.

Normalize — rescale to mean ≈ 0:

before

1.3

1.2

1.7

0.9

1.5

1.1

1.5

after

-0.31

-0.40

1.27

1.11

-1.76

0.52

-1.07

0.64

Layer norm keeps values stable. Without it, numbers would grow out of control after dozens of additions.

This "add and normalize" pattern happens twice per layer. It's the transformer's immune system — keeping the signal healthy through 96 layers of processing.

The Thinking Step

Attention decided which words to mix. Now each word processes that mixture independently — through a small neural network:

No cross-word interaction — each word independently. Our tiny FFN has 512 weights (8×32 + 32×8). The attention Q/K/V matrices have only 96 (3 × 8×4). The FFN is 5× bigger — and that ratio holds at any scale.

After the FFN, the same add-and-normalize pattern repeats (adding back the pre-FFN values). That's one complete layer.

Putting It All Together

Does it actually learn? Watch a 2-layer transformer train live — right here in your browser. Same fish sentences from Part 1. Watch the attention patterns emerge from noise, the loss drop, and the model learn which words fit which contexts.

Initializing…

That's what you saw in the 3D visualization above — 12 layers of this, each one reshaping the geometry. Attention decides which words to mix. The FFN decides what to do with the mixture. Residual connections keep the signal alive. Layer by layer, the representations get richer.

OpenAI took the other half. They ripped out the decoder, threw away the encoder, and built GPT. It couldn't read as well as BERT — it lost every benchmark. But the decoder could do something the encoder couldn't: write. And OpenAI had a plan for that.

Read Part 3: How to teach a rock to write →