The birth of the token
What is language, really? If you strip away the poetry and the emotion, language is just a series of patterns we use to communicate. It's a structured noise that somehow turns into an abstract concept in our heads.
For the longest time, computers were absolutely terrible at this. They could count characters, match exact strings, and follow hard-coded grammar rules built by PhDs, but they didn't understand anything. If you built an AI a few years ago to classify movie reviews, it was probably incredibly narrow. If you didn't explicitly tell the computer that the word "terrific" was good, it had no idea.
To solve this, we had to stop teaching computers the dictionary. We had to turn words into math.
The "Show Me Your Friends" Algorithm
In 2013, a team at Google tried something almost stupidly simple. They built a model called Word2Vec and gave it one job: guess the missing word based on its neighbors.
That's it. No grammar rules. No dictionaries. No understanding of what words actually mean. Just: "here are the five words to the left and five words to the right — what goes in the middle?"
They fed it billions of words from the internet and let it guess. Millions of times. And then something remarkable happened.
The model didn't know what a mackerel or a cod looked like. It had never seen the ocean. But it noticed that these words kept showing up in the exact same company — both swim, both like saltwater, both hang out on the West Coast.
By just predicting neighbors, the model was forced to figure out what words mean. And it stored that meaning as a number — a vector — placing words that share context right next to each other in a giant mathematical space.
Word2Vec actually had two ways to do this:
CBOW (Continuous Bag of Words) looks at the neighbors and guesses the missing word — exactly like the exercise above. Skip-gram flips it: start with a word and guess what's around it.
Both approaches learn the same thing: words that share context share meaning.
But similar words aren't identical. Some contexts only accept certain words — a salmon swims upstream to spawn, but you'd never say that about a cod. These differences are what give each word its unique position in the vector space.
But how does it actually learn to predict the right word?
It starts with random weights — completely clueless. Then it sees thousands of sentences, makes predictions, checks if it was right, and adjusts its weights a tiny bit each time. Watch the training process unfold:
{ }Deep dive: how neural networks actually learnDrag weights, break things, watch the math — no equations neededclick to expand
What IS a weight?
Every connection in a neural network is a number — a weight. When data flows through, each input gets multiplied by its weight, then everything gets added up. Different weights = different output. That's it.
Click any line in the network below to adjust its weight. Watch how the nodes downstream change:
Every number you just dragged? That's a weight. The network you played with has 20. Word2Vec has thousands. GPT-4 has trillions. Training = finding the right value for every single one.
Tuning the Word2Vec network
Here's the same Word2Vec network from above — but now you can switch between training stages and see how the predictions change. Start with random weights (garbage predictions), then watch what "Trained" looks like. Then try breaking a weight yourself.
That's the whole game. The network's ability to predict fish words comes entirely from these weight values. Training is just the process of finding values that work — across ALL tasks at once.
But how do we know if we're right?
We need a single number that says "how wrong are these predictions?" — the loss. Cross-entropy loss computes it for each task: take the confidence in the correct answer, and compute -log of it. High confidence = low loss. Low confidence = catastrophic loss.
Notice how the herring task dominates the total loss — the model barely thinks it's herring. Training will focus on fixing THAT first, because the punishment is exponentially harsh for low confidence.
How does each weight learn its share of the blame?
The loss tells us how wrong we are. But the model has thousands of weights — how does each one know whether to go up or down, and by how much? Through backpropagation: the error at the output flows backwards through the network, and each weight receives a gradient — a number saying "this is how much YOU contributed to the mistake."
Each weight gets its own gradient. The embedding for "swimming" gets a different gradient than the embedding for "wild" — because they contributed differently to the prediction. Weights that caused more error get bigger gradients and bigger updates. One backward pass gives us ALL gradients simultaneously — no need to test each weight one at a time.
Taking a step downhill
Now we have a gradient for every weight. The update rule is simple: weight -= learning_rate × gradient. Multiply each gradient by a small number (the learning rate), and nudge each weight in the opposite direction of its gradient. That's one step. Repeat thousands of times.
That's the entire learning algorithm. Forward pass (predict) → compute loss (how wrong?) → compute gradients (which direction?) → update weights (take a step). Every AI model — from Word2Vec to GPT-4 — learns this way.
Welcome to the Vector Space
Word2Vec wasn't designed to create a space of meaning. It was just trying to guess context words. But the space emerged anyway — and it turned out to be spectacular.
The result is a high-dimensional vector space, usually about 300 dimensions. We can't picture 300 dimensions, but if we squash them down to 3D, we get a peek at what the model built.
And what it built is beautiful. Words that mean similar things cluster together. "Lawyer" (jurist) and "Attorney" (advokat) are practically on top of each other. "Lawyer" and "Company" (företag) are somewhat close, but noticeably further apart. The geometry is the meaning.
When I first saw this, I was completely blown away. I started training my own models — feeding them different text corpora, watching the spaces form. It was like having a universal word calculator. You could query it for synonyms, find words that appear in the same contexts, discover relationships nobody had programmed. I spent weeks just playing with it, throwing words in and seeing what came back.
But here is where it gets truly mind-blowing. Once words are just coordinates in space, you can do math with them.
The Magic of Word Algebra
Here's the part that made researchers' jaws drop. Because relationships between concepts are stored as physical distances and directions in this space, those directions are consistent. The direction from "King" to "Queen" is the same as the direction from "Man" to "Woman." It's a vector that essentially means "make it female."
That means you can pick up that direction — that arrow — and apply it somewhere else. You can do math with meaning:
Take King, subtract Man, add Woman — you land almost exactly on Queen. The model was never told about gender. It figured out the pattern by reading millions of sentences.
And it doesn't stop there:
The concept of "is the capital of" exists as a direction in vector space. Swap out the country, apply the same arrow, and you get the new capital.
Fastest minus Fast, plus Good — you get Best, not "Goodest." The model didn't just learn a suffix; it learned the concept of superlative.
Nobody programmed these relationships. The model discovered them on its own, just from reading how words tend to appear near each other. The geometry of the space is the meaning.
In the animation below, notice how the arrow from "Tiger" to "Tigers" — the "plural" direction — keeps roughly the same length and direction when you move it to any other animal. It's not exact, but it's close. That approximate consistency is the whole trick.
The Flaw: Why Word2Vec Wasn't Enough
But there was a massive problem. Every word gets exactly one vector — a single, static coordinate, looked up from a fixed table after tokenization. The same vector every time, in every sentence, in every context. The word "play" in "play football" and "play the piano" — identical dot in space. And it gets worse:
This movie was shit. vs. This movie was the shit.
One word difference. Opposite meanings. Word2Vec can't tell them apart — "shit" gets one vector, stuck between insult and compliment.
To actually understand language, the AI needed to stop looking at words in isolation. It needed to see the entire text at once. It needed "Attention."
And that's when everything changed. Read Part 2: How to teach a rock to understand →