How to teach a rock to write

The other half

Part 2 ended with the encoder — BERT, the reader. Google took one half of the Transformer and built the best reading comprehension machine the world had ever seen.

OpenAI took the other half: the decoder.

Where the encoder sees every word at once — bidirectional, the full picture — the decoder wears a blindfold. Each token can only attend to what came before it. A causal mask blocks everything ahead. Its entire existence is one task: predict the next token.

Watch the difference. The same geometry from Part 2 — directions, distances, patterns emerging layer by layer — but with fewer connections. Each token only reaches backward. No cheating, no looking ahead. Just: given everything so far, what comes next?

In 2018, OpenAI released GPT-1 ¹ — a 117-million-parameter decoder trained on books. It was fine. Not impressive. BERT wiped the floor with it on every reading comprehension benchmark.

A year later, GPT-2 ² scaled to 1.5 billion parameters — 13× bigger — and trained on web text instead of just books. It could write coherent paragraphs. Creative stories. Fake news articles so convincing that OpenAI initially refused to release the full model. But it hallucinated constantly — confident nonsense dressed up as fact. And on structured tasks, BERT still won.

The encoder was the better reader. The decoder was the better writer. And writers, it turns out, scale.

The infinite autocomplete

In 2019, most researchers still bet on the encoder. BERT owned the benchmarks. The decoder was the architecture that rambled.

OpenAI bet on scale. GPT-3: 175 billion parameters. ³

The World's Most Expensive Autocomplete

Here's the thing about GPT-3 that blew people's minds: it was never programmed to do anything specific. No one told it what French grammar looks like. No one trained it to write code. Its entire mathematical existence was dedicated to one beautifully simple task: guess the next token.

Given the text so far, assign a probability to every possible next word — then roll the dice. "The capital of France is..." → Paris (97.9%).

???

The

72.0%

15.0%

8.0%

Our

5.0%

Every word is a weighted dice roll. Nothing more.

It doesn't always pick the most likely token. It samples — rolls weighted dice, where probable tokens win more often but surprises happen.

Trained on the internet — blogs, forums, code repositories, Reddit arguments — GPT-3 learned that when text looks like a recipe, you finish a recipe. When it looks like Python, you write Python. String enough of these dice rolls together and the output starts to look like intelligence. Not understanding — extraordinarily sophisticated pattern matching.

The Illusion of Knowledge and the Hallucination Problem

I wasn't just reading about these models — I was training them. In late 2019, I'd built a Swedish language model on borrowed TPUs. I asked it for a meatball recipe. It started perfectly — then told me to fold in the lingonberry jam before frying. Any Swede just winced: lingonberry goes on the plate, never in the pan. Harmless. But the same failure mode — hallucination, where the model doesn't know what's true, only what sounds true — gets dangerous fast. Ask for medical advice and you'll get a confident dosage that could kill someone. The model produces fluent text with identical confidence whether it's right or wrong.

Hover over highlighted words to see confidence scores.

But Which Token?

The model assigns probabilities — but how do we actually pick? The simplest approach, greedy decoding, always takes the most likely token. Safe, but robotic. Researchers built a toolkit of strategies to control the randomness.

Temperature scales the entire distribution. Low temperature sharpens it — the model locks onto the safest answer. High temperature flattens it — every token gets a fighting chance. Drag the slider below and watch "2000mg" go from impossible to plausible. Same model, same question — just a different number in a config file.

???

200mg

44.0%

400mg

15.0%

800mg

10.0%

2000mg

31.0%

ConfidentT = 1.0Creative

A single number in a config file decides whether the model recommends 200mg or 2000mg.

This isn't just theory. Google's Gemini 3 defaults to temperature 1.0 and explicitly warns that lowering it "may lead to unexpected behavior, such as looping or degraded performance." ⁴ By 2025, temperature 1.0 had become the industry standard — not a creative choice, but an engineering requirement.

Top-k is a hard filter: only consider the k most likely tokens, ignore everything else. Simple but effective.

???

200mg

48.9%

400mg

16.7%

800mg

0.0%

2000mg

34.4%

Filter out unlikely tokens — then sample from what survives.

Top-p (nucleus sampling) is smarter — keep adding tokens from the top until their combined probability reaches a threshold. This adapts to the shape of the distribution: when the model is confident, fewer tokens survive. When it's uncertain, more get through.

???

200mg

48.9%

400mg

16.7%

800mg

0.0%

2000mg

34.4%

Strictp = 0.90Generous

Keep tokens until their combined probability reaches p, then sample.

Repetition penalty solves a different problem: without it, the model loves to repeat itself. Penalize tokens that already appeared, and the output stays fresh.

???

cats

50.0%

cute

22.2%

the

16.7%

small

11.1%

OFFON

Already said "cats". Penalize it so the model says something new.

And then there's beam search — instead of committing to one token at a time, it explores multiple paths simultaneously and picks the best complete sequence.

Greedy would pick "nice". Beam search finds "going to rain" — a better complete sentence.

Prompt Engineering: Whispering to the Machine

There was another problem. GPT-3 didn't know you were talking to it. It thought it was finishing a web page.

If you typed "Translate 'Where is the library?' to Swedish," the model might generate more questions — because it decided you were writing a list of exam questions. Or it might start writing a Wikipedia article about the Swedish language. It wasn't broken. It was completing the page.

To get useful output, you had to format your text so that the most likely continuation was the answer you needed. The community called this Prompt Engineering — half science, half black magic. Show the model a pattern, and it would continue it.

>Translate "Where is the library?" to Swedish

Without a pattern to follow, the model lectures instead of translating.

One model, no fine-tuning — just a different prompt for marketing copy, code, translation, or reading comprehension. It was yxigt — a Swedish word for "angular and clunky in a way that makes it awkward to handle." Half the time you'd get brilliant code; the other half, a confidently written conspiracy theory about 5G towers. But when it worked, you saved hours. When it didn't, you lost thirty seconds reading garbage. That asymmetry was enough to build an entire industry on.

Why Not 200 Examples?

If two examples steer the model that well, why not paste two hundred? Because of a wall hidden inside the Transformer itself.

Remember attention from Part 2? Every token looks at every other token to decide what matters. That's an n × n matrix — and n is the number of tokens in your prompt plus everything the model has generated so far. Double the context, quadruple the memory.

12 × 12

Context length512 tokens

n²262.1K

Memory (attention matrix)4.5 GB

512² × 96 heads × 96 layers × 2 bytes

GPT-3GPT-4Modern

Short promptn = 512Long context

GPT-3's context window was 2,048 tokens — roughly 1,500 words. ³ Your two-hundred examples, the instructions, and the answer all have to fit inside that box. Cram in too much and you literally run out of memory mid-sentence. This is why prompt engineering was an art of compression: say more with fewer tokens.

The mechanism was powerful but crude. The natural question: what happens when you make it bigger?

Read Part 4: How to grow a rock →