How to teach a rock to talk
NLP Draft

How to teach a rock to talk

ChatGPT, RLHF, and the art of giving a calculator time to think — how the autocomplete learned to talk back.

March 1, 2026

The autocomplete that didn't know you were there

Part 5 ended with a rock that could read, write, see, hear, and dream. It understood images well enough to describe them. It understood speech well enough to transcribe it. It could generate art that made people stop scrolling.

But it still didn't know you were talking to it.

GPT-3 was the most capable AI system anyone had ever built — and it thought it was finishing a web page. Every lab in the world noticed. Meta trained massive models and released them open-source. NVIDIA built Megatron. Mistral appeared in France. Hugging Face talked about training a giant community model. Money poured in. But nobody quite matched GPT-3's generality. The decoder architecture had won — but it was still Total Roulette. Brilliant one moment, confidently wrong the next.

Something had to change. Not in the architecture — in the training.


Teaching it to talk back

The idea was deceptively simple: fine-tune the model to be a chatbot. Instead of completing web pages, teach it to complete conversations. Many researchers had tried this — formatting chat transcripts so the model would learn the back-and-forth pattern. It worked, sort of. The model would stay in character longer. But it still drifted, still hallucinated, still gave answers no human would find helpful.

The breakthrough was RLHF — Reinforcement Learning from Human Feedback. 1 Train a second model, a reward model, whose entire job is to answer one question: "Would a human like this response?"

The setup mirrors something from Part 5: when we made a rock dream, we used CLIP to judge BigGAN's drawings — a critic steering a creator. RLHF is the same idea, but for text. The language model generates a response. The reward model scores it. The language model updates. Generate, judge, improve. Generate, judge, improve.

The result was ChatGPT. And overnight, everything changed.

Not because the model was fundamentally smarter — it was still GPT-3.5 under the hood. But because it listened. You could ask a question in plain language and get a direct answer. No prompt engineering. No formatting tricks. No pretending to be a web page.

Same question, different era
GPT-3 (2020)
The following is a translation task.
 
English:Good morning
Swedish:God morgon
English:Thank you
Swedish:Tack
 
English:Where is the library?
Swedish:
ChatGPT (2022)
UserTranslate "Where is the library?" to Swedish
Assistant
Same answer. The model learned to listen.

Suddenly steerable. Suddenly accessible to anyone with a browser. Still small, still dangerous, still confidently wrong about things it shouldn't be — but usable. The asymmetry that built the prompt engineering industry was now available to everyone: when it worked, you saved hours. When it didn't, you lost thirty seconds.

ChatGPT reached 100 million users faster than any product in history. 2 The world noticed. Not just engineers — everyone.


Giving it time to think

There was still a problem. The model answered reflexively — like asking someone to respond instantly, no time to think. Good at things it "just knew" from training. Terrible at anything requiring reasoning.

Ask GPT-3.5 a logic puzzle and it would confidently blurt out the wrong answer. Not because it couldn't reason — because it was never given time to reason. Every token it generated was an immediate gut reaction, no scratch paper allowed.

Researchers at Google found the fix: chain-of-thought prompting. 3 Instead of jumping to the answer, let the model reason step by step. "Think about this first, then answer."

A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost?
$0.10
Wrong
Reflexive answer. Fast, but wrong.

The improvement was dramatic. Problems that models got wrong reflexively, they solved when given room to think. And later, OpenAI's o1 5 baked thinking directly into the model — it would generate reasoning tokens before answering, like scribbling on a notepad before committing to an answer.

This was a paradigm shift. For years, progress meant bigger models trained on more data. But the internet is finite — you can't just keep feeding it more text. The new frontier wasn't bigger models. It was smarter inference — more compute at answer time, not training time.


Teaching it to reach

A chatbot that thinks is still a chatbot. It can only work with what's already in its weights — the compressed memory of everything it read during training. Ask it about today's weather and it'll tell you about climate patterns. Ask about a specific person and it'll hallucinate a plausible biography.

Google's LaMDA 4 and Meta's Toolformer 6 pointed the way out: let the model use tools. Instead of answering from memory alone, let it decide "I need to search for this" — and actually search.

User
What's the weather in Stockholm right now?
Model thinks
I don't have real-time weather data. I need to search.
Tool call
search("Stockholm weather today")
Result
Stockholm: 4°C, partly cloudy, wind 12 km/h
Model thinks
Now I can answer with current data.
Answer
It's currently 4°C in Stockholm with partly cloudy skies and light wind at 12 km/h.

The model reads the question. Realizes it doesn't know the answer. Generates a tool call — a structured request to search the web, query a database, check a calendar, read an email. Gets the result. Integrates it. Maybe makes another tool call. Finally answers — with real, grounded information.

This was the moment the model stopped being a text generator and became a problem solver. Not just predicting the next token — deciding what action to take next. The next token might be an answer, or it might be a function call. The model learned to reach out into the world.


The rock talks

We taught the rock to talk. Not just autocomplete — actual conversation. It listens to what you want. It thinks before it speaks. It reaches out for information it doesn't have.

But one agent with tools is still one agent. It can search, reason, and act — but it works alone. Complex problems need more than one perspective. Real work needs planning, execution, review, iteration. It needs a team.

We taught the rock to talk. To think before it speaks. To reach out for what it doesn't know. But one agent is still one agent. What happens when it gets a team — and shows up for work?