How to teach a rock to see

The universal language

Part 4 ended with a quiet revelation. Every Transformer — encoder, decoder, text-to-text — speaks the same language: tokens. And a token is just a number.

Nothing in the math requires that number to represent a word. It could represent a pixel. A sound. A frame of video. Researchers looked at the Transformer and asked: what if we just... feed it something else?

But what does "token" actually mean for text? Computers don't see words the way we do. They break text into tokens — chunks that might be whole words, word pieces, or even individual characters. The word "unhappiest" might become three pieces: "un", "happi", "est". Common words like "the" stay whole. Rare words get split. Different models split differently — two models looking at the same sentence can produce completely different tokens.

The unhappiest researcher discovered tokenization

Result: 9 tokens from 5 words

Going forward, we'll just say "words" to keep it simple.

Interactive tokenizer loading...

That's the text side. But the Transformer doesn't care where the numbers come from.

Teaching a rock to see

The answer turned out to be almost embarrassingly simple. Take a photo. Chop it into a grid of 16×16 patches. Flatten each patch into a vector — exactly like a word embedding from Part 1. Feed the sequence into a Transformer. ¹

Same attention mechanism from Part 2. Same architecture. But now, instead of words attending to words, image patches attend to image patches. The cat's ear learns that the whiskers matter. The sky learns to ignore the ground.

The paper's title said it all: "An Image is Worth 16×16 Words."

EMBEDDING SPACEsimilar patches cluster together

[CLS] + 49 TOKENS

[CLS]

[0.82, 0.71, 0.64, 0.79, ...]_4,608×W=embedding

pixel valueslearned weights= embedding

EMBEDDING VECTOR

[0.420, 0.880, 0.150, 0.670, 0.930, 0.310, 0.760, 0.540, 0.850, 0.290, 0.610, 0.740, 0.500, 0.330, 0.780, 0.450, 0.910, 0.220, 0.680, 0.570, ...]

768 dimensions

And it worked. Not sort-of worked — it matched or beat the best image classifiers in the world, models that had been purpose-built for vision over a decade. The Transformer didn't care that these weren't words. Tokens are tokens.

Connecting eyes and ears

If images, audio, and text are all just sequences of numbers, could you put them in the same space? OpenAI's CLIP ² trained two encoders — one for images, one for text — pushing matching pairs close together across 400 million image-caption pairs. The result was the vector space from Part 1 — but now words and images lived in it.

Loading 3D visualization...

Whisper ⁴ took it further: point the encoder-decoder Transformer at spectrograms and let it "translate" speech into text. The same architecture that translated English to French, now translating sound to words.

Looking into the space

With images and text in the same space, we could do something new: look between concepts. What lives in the middle of "lemon," "dwarf," and an image of a robot? In 2021, my team at Labelf tried to find out. We hooked BigGAN — an image generator from 2018 — up to CLIP. CLIP picks a position in the multimodal space based on the prompt "lemon dwarf robot," and BigGAN tries to paint what that position looks like. Frame by frame, CLIP steers, BigGAN renders. (BigGAN is old and a mediocre painter — the visuals are an approximation of what the space contains, not a perfect rendering. Don't fixate on the artifacts.)

Loading visualization...

But look past BigGAN's limitations and watch the scales: the lemon-scale, the dwarf-scale, the robot-scale. The sphere never goes fully to one concept — it always retains traces of the others. You're watching the geometry of the space move. All the patterns, all the connections between concepts — that's what lives in this geometry. And for the first time, we could actually see it.

DALL-E ³ went further: text in, image out. Stable Diffusion ⁵ made it open-source and fast enough to run on a laptop. The Transformer wasn't just reading the world. It was drawing it.

Teaching a rock to hear

Sound is just a grid of frequencies. Convert audio to a mel spectrogram — a heatmap of time vs. frequency — and it looks like an image. The Audio Spectrogram Transformer ⁶ did exactly what ViT did: chop it into 16×16 patches and feed them into a Transformer. Same architecture, no audio-specific tricks. Tokens are tokens.

EMBEDDING SPACEsimilar patches cluster together

[CLS] + 60 TOKENS

[CLS]

[0.82, 0.71, 0.64, 0.79, ...]₂₅₆×W=embedding

mel magnitudeslearned weights= embedding

EMBEDDING VECTOR

[0.420, 0.880, 0.150, 0.670, 0.930, 0.310, 0.760, 0.540, 0.850, 0.290, 0.610, 0.740, 0.500, 0.330, 0.780, 0.450, 0.910, 0.220, 0.680, 0.570, ...]

768 dimensions

Meta's MusicGen ⁷ flipped it: instead of an encoder reading audio tokens, a decoder writes them — predicting the next one autoregressively, exactly like GPT predicts the next word. Same architecture as Part 3. Different tokens.

One architecture, every sense

By 2022 the same Transformer architecture — unchanged since 2017 — was reading text, classifying images, transcribing speech, and generating art. No one redesigned it. They just changed what the tokens represented. With multimodality solved and scaling laws in full bloom, there were no fundamental breakthroughs left to wait for.

It was a time, steering, data, and funding game now.

But the rock still didn't know you were talking to it. What happens when someone teaches it to listen?

Read Part 6: How to teach a rock to talk →