[{"data":1,"prerenderedAt":1513},["ShallowReactive",2],{"series-hur-oslo-blir-stockholm-och-stockholm-blir-silicon-valley":3,"recommended-hur-oslo-blir-stockholm-och-stockholm-blir-silicon-valley":4},null,[5,497,1139],{"_path":6,"_dir":7,"_draft":8,"_partial":8,"_locale":9,"title":10,"description":11,"order":12,"translationSlug":13,"img":14,"date":15,"tag":16,"featured":17,"author":18,"body":22,"_type":491,"_id":492,"_source":493,"_file":494,"_stem":495,"_extension":496},"/article/en/how-oslo-becomes-stockholm-and-stockholm-becomes-silicon-valley","en",false,"","How Oslo Becomes Stockholm — and Stockholm Becomes Silicon Valley","Buy versus build, early capital, and why we have to stop building everything in-house if we want an ecosystem that actually flies.",0,"hur-oslo-blir-stockholm-och-stockholm-blir-silicon-valley","/img/ecosystem-hero.svg","2026-05-19","Entrepreneurship",true,{"name":19,"bio":20,"authorImage":21},"Viktor Alm","I like to build stuff","/img/viktor.jpg",{"type":23,"children":24,"toc":479},"root",[25,33,42,47,52,56,63,68,73,78,83,88,93,96,102,107,118,128,138,148,153,158,161,167,172,177,189,202,207,212,217,220,226,238,243,248,251,257,262,274,279,287,292,307,312,324,327,333,338,343,348,353,356,362,367,421,426,429,435,440,445,448,454,459,464,469,474],{"type":26,"tag":27,"props":28,"children":29},"element","p",{},[30],{"type":31,"value":32},"text","I was at a conference in Oslo and someone asked the question straight out:",{"type":26,"tag":34,"props":35,"children":36},"blockquote",{},[37],{"type":26,"tag":27,"props":38,"children":39},{},[40],{"type":31,"value":41},"How does Norway become as good as Sweden at building AI companies?",{"type":26,"tag":27,"props":43,"children":44},{},[45],{"type":31,"value":46},"Good question. And as someone from Gothenburg who just moved back from Stockholm, I sit on a parallel version of it: how does Gothenburg measure up to Stockholm? And if we're already there — how does Stockholm become better than Silicon Valley?",{"type":26,"tag":27,"props":48,"children":49},{},[50],{"type":31,"value":51},"I'm going to answer all three at once, because it's the same question.",{"type":26,"tag":53,"props":54,"children":55},"hr",{},[],{"type":26,"tag":57,"props":58,"children":60},"h2",{"id":59},"its-an-ecosystem-not-a-company",[61],{"type":31,"value":62},"It's an ecosystem, not a company",{"type":26,"tag":27,"props":64,"children":65},{},[66],{"type":31,"value":67},"Stockholm works because it's had time to work. Skype, Spotify, the Stardoll mafia, most recently Sana. Internet and dotcom companies have existed there since the late 90s. Old finance guys with deep pockets are deep into tech. The private equity world is right next to it. It's a full ecosystem that has existed for decades.",{"type":26,"tag":27,"props":69,"children":70},{},[71],{"type":31,"value":72},"Take one of our angels as an example. He's invested in over 30 companies, helped build several earlier ones, has an exit behind him, has money — and most importantly, he reinvests that money in the ecosystem. That's where the magic lives.",{"type":26,"tag":27,"props":74,"children":75},{},[76],{"type":31,"value":77},"Because it's one thing when a company sells and the money lands straight in the founders' pockets. It's something else entirely when that money comes back into the next company. Or take the Norwegian Oil Fund: kroner from Norwegian oil parked in Tesla and other foreign companies. Fine returns — but it doesn't build a domestic ecosystem. You have to invest in your own.",{"type":26,"tag":27,"props":79,"children":80},{},[81],{"type":31,"value":82},"And when ten, twenty, a hundred of these people reinvest at the same time, something happens: the early capital required to break loose becomes available. You can put together a board where someone has already taken a company from zero to fifty employees. Someone who has done an exit. Someone who already has the contacts and knows how to sell a B2B product to a large customer.",{"type":26,"tag":27,"props":84,"children":85},{},[86],{"type":31,"value":87},"This feeds itself. The first five companies do well, sell, and the talent gets pumped back into circulation. The next wave is ten. Then thirty. Skills compound, money compounds, contacts compound. The momentum builds on itself.",{"type":26,"tag":27,"props":89,"children":90},{},[91],{"type":31,"value":92},"Silicon Valley has had this mechanism running since the 70s and 80s. That's why it's hard to beat.",{"type":26,"tag":53,"props":94,"children":95},{},[],{"type":26,"tag":57,"props":97,"children":99},{"id":98},"what-needs-to-be-in-place",[100],{"type":31,"value":101},"What needs to be in place",{"type":26,"tag":27,"props":103,"children":104},{},[105],{"type":31,"value":106},"Starting an ecosystem from nothing requires four things at once:",{"type":26,"tag":27,"props":108,"children":109},{},[110,116],{"type":26,"tag":111,"props":112,"children":113},"strong",{},[114],{"type":31,"value":115},"1. Early capital.",{"type":31,"value":117}," Not grants from the eighteenth agency that requires the project to satisfy fourteen politically-set bureaucratic rules. Real money from people who have done exits and know what they're doing.",{"type":26,"tag":27,"props":119,"children":120},{},[121,126],{"type":26,"tag":111,"props":122,"children":123},{},[124],{"type":31,"value":125},"2. Founders with the right backbone.",{"type":31,"value":127}," High technical level, organizational skill, someone who has already built something. Either they exist in the environment, or the environment attracts them.",{"type":26,"tag":27,"props":129,"children":130},{},[131,136],{"type":26,"tag":111,"props":132,"children":133},{},[134],{"type":31,"value":135},"3. A talent pool that grows with the companies.",{"type":31,"value":137}," Juniors who are easy to shape. Experienced salespeople. Product managers who have implemented something ten times before. People who have been on the ride at earlier companies.",{"type":26,"tag":27,"props":139,"children":140},{},[141,146],{"type":26,"tag":111,"props":142,"children":143},{},[144],{"type":31,"value":145},"4. The ability to even get started.",{"type":31,"value":147}," Before you have an investor, you're on your own savings. You work in parallel, take a leave of absence, live on a consulting pot. My own starting capital was a consulting sum I had worked together — it funded the time I needed to build the prototype that became the foundation of Labelf.",{"type":26,"tag":27,"props":149,"children":150},{},[151],{"type":31,"value":152},"This first step is harder than people think. And it's one of the places where society can help the most. Real employee stock options that aren't taxed to death. Founder-friendly tax treatment for early employees. Tax breaks for investors who take real risk in early-stage companies.",{"type":26,"tag":27,"props":154,"children":155},{},[156],{"type":31,"value":157},"The US has QSBS — Qualified Small Business Stock. Invest in a qualified early-stage company and hold for five years, and you can exclude up to ten times your basis, or at least ten million dollars, completely tax-free at exit. A five-million investment can generate fifty million in tax-free capital gains. That's why the willingness to write early checks is so high there. The upside is enormous. And the upside drives behavior.",{"type":26,"tag":53,"props":159,"children":160},{},[],{"type":26,"tag":57,"props":162,"children":164},{"id":163},"schools-are-network-not-skill",[165],{"type":31,"value":166},"Schools are network, not skill",{"type":26,"tag":27,"props":168,"children":169},{},[170],{"type":31,"value":171},"The first thing people instinctively point to is the universities. Stockholm has KTH. Boston has MIT. The Bay Area has Stanford.",{"type":26,"tag":27,"props":173,"children":174},{},[175],{"type":31,"value":176},"That's the wrong cause.",{"type":26,"tag":27,"props":178,"children":179},{},[180,182,187],{"type":31,"value":181},"All technical information is online. Everything can be self-taught. What schools actually deliver is ",{"type":26,"tag":111,"props":183,"children":184},{},[185],{"type":31,"value":186},"network",{"type":31,"value":188},": you meet someone whose dad has done an exit. You already know who to call when you need a marketer, because it was Lisa who was great at it back when you sat in the same lecture hall.",{"type":26,"tag":27,"props":190,"children":191},{},[192,194,200],{"type":31,"value":193},"That part is valuable. But in raw skill, school doesn't matter — it's negligible. The role of schools is an ",{"type":26,"tag":195,"props":196,"children":197},"em",{},[198],{"type":31,"value":199},"effect",{"type":31,"value":201}," of the rest of the ecosystem already being in the same place. Not an independent advantage.",{"type":26,"tag":27,"props":203,"children":204},{},[205],{"type":31,"value":206},"And the tech team doesn't even need to sit in the same place. My technical team is remote first. There's no reason product builders should sit next to each other and disturb each other while they're coding. They can just as well sit in Discord like gamers — build their things, share screens when needed.",{"type":26,"tag":27,"props":208,"children":209},{},[210],{"type":31,"value":211},"Real talent finds what it needs on papers and GitHub. They don't depend on some lecturer teaching them the basics.",{"type":26,"tag":27,"props":213,"children":214},{},[215],{"type":31,"value":216},"(That's a different question if you're cranking out code monkeys for pointless internal projects with no real depth. Then you need junior engineers on site. But that's not the part that builds an ecosystem.)",{"type":26,"tag":53,"props":218,"children":219},{},[],{"type":26,"tag":57,"props":221,"children":223},{"id":222},"you-need-a-domestic-early-adopter-market",[224],{"type":31,"value":225},"You need a domestic early-adopter market",{"type":26,"tag":27,"props":227,"children":228},{},[229,231,236],{"type":31,"value":230},"If you're building direct-to-consumer, you need access to a home market that consists of ",{"type":26,"tag":111,"props":232,"children":233},{},[234],{"type":31,"value":235},"early adopters",{"type":31,"value":237},", not laggards. Operate in a country where everyone is waiting for the technology to mature and it's incredibly hard to innovate.",{"type":26,"tag":27,"props":239,"children":240},{},[241],{"type":31,"value":242},"Sweden had unusually high internet penetration during the dotcom era. When things started moving, the audience was already there. Spotify was born in a country where smartphone adoption was high early — that gave them a base to test against before going global.",{"type":26,"tag":27,"props":244,"children":245},{},[246],{"type":31,"value":247},"That's not a coincidence. It's a precondition.",{"type":26,"tag":53,"props":249,"children":250},{},[],{"type":26,"tag":57,"props":252,"children":254},{"id":253},"and-now-the-big-one-buy-versus-build",[255],{"type":31,"value":256},"And now the big one: buy versus build",{"type":26,"tag":27,"props":258,"children":259},{},[260],{"type":31,"value":261},"This is what holds Sweden and Norway back the most right now. And it's what politicians, enterprises and the public sector can actually change directly.",{"type":26,"tag":27,"props":263,"children":264},{},[265,267,272],{"type":31,"value":266},"If you're building a B2B product, there has to be a culture of ",{"type":26,"tag":111,"props":268,"children":269},{},[270],{"type":31,"value":271},"buying",{"type":31,"value":273}," instead of building in-house.",{"type":26,"tag":27,"props":275,"children":276},{},[277],{"type":31,"value":278},"The question an enterprise should ask itself:",{"type":26,"tag":34,"props":280,"children":281},{},[282],{"type":26,"tag":27,"props":283,"children":284},{},[285],{"type":31,"value":286},"Why are we building this tool ourselves, when 300 other companies have exactly the same problem?",{"type":26,"tag":27,"props":288,"children":289},{},[290],{"type":31,"value":291},"Because when you answer \"because we want it customized\" and kick off an internal project with one of the big consulting firms, two things happen:",{"type":26,"tag":293,"props":294,"children":295},"ol",{},[296,302],{"type":26,"tag":297,"props":298,"children":299},"li",{},[300],{"type":31,"value":301},"You spend several times more money than you would have paid for a SaaS, and it takes several times longer.",{"type":26,"tag":297,"props":303,"children":304},{},[305],{"type":31,"value":306},"You kill the chance of the domestic startup that could have solved the problem — because it gets no customer, no reference, no revenue.",{"type":26,"tag":27,"props":308,"children":309},{},[310],{"type":31,"value":311},"And when your internal project stalls or has to be scrapped two years later, you buy a solution from an American company that managed to grow in the meantime. The Swedish or Norwegian company that had a real chance — that could have been your American company's competitor — no longer exists. You opted out of the whole team.",{"type":26,"tag":27,"props":313,"children":314},{},[315,317,322],{"type":31,"value":316},"This also applies to municipalities, regions, agencies and the armed forces. Letting massive consulting firms build in-house instead of procuring from new companies is a national economic waste. Two losses: more expensive solution ",{"type":26,"tag":195,"props":318,"children":319},{},[320],{"type":31,"value":321},"and",{"type":31,"value":323}," killed ecosystem.",{"type":26,"tag":53,"props":325,"children":326},{},[],{"type":26,"tag":57,"props":328,"children":330},{"id":329},"where-does-talent-spend-its-hours",[331],{"type":31,"value":332},"Where does talent spend its hours?",{"type":26,"tag":27,"props":334,"children":335},{},[336],{"type":31,"value":337},"The same logic applies to you.",{"type":26,"tag":27,"props":339,"children":340},{},[341],{"type":31,"value":342},"You have a limited number of productive years. Don't spend them on Accenture's big contracts. Don't spend them building internal tools for Amazon. Don't spend them on a custom system that dies with the contract.",{"type":26,"tag":27,"props":344,"children":345},{},[346],{"type":31,"value":347},"Spend them on something that can scale. Find a problem a hundred companies have and extract it. Join the ride at an early-stage company. Or start your own.",{"type":26,"tag":27,"props":349,"children":350},{},[351],{"type":31,"value":352},"You don't get a Spotify out of a staffing firm. And you don't get a product ecosystem out of a country where the product thinkers sell their time piecemeal to big offices.",{"type":26,"tag":53,"props":354,"children":355},{},[],{"type":26,"tag":57,"props":357,"children":359},{"id":358},"what-politicians-can-do-concretely",[360],{"type":31,"value":361},"What politicians can do concretely",{"type":26,"tag":27,"props":363,"children":364},{},[365],{"type":31,"value":366},"None of this has to be mysterious:",{"type":26,"tag":368,"props":369,"children":370},"ul",{},[371,381,391,401,411],{"type":26,"tag":297,"props":372,"children":373},{},[374,379],{"type":26,"tag":111,"props":375,"children":376},{},[377],{"type":31,"value":378},"Less bureaucracy around grants.",{"type":31,"value":380}," The whole industry of \"apply-for-money\" consultants is a symptom of a broken system. Fewer rules, faster decisions, more capital to fewer projects.",{"type":26,"tag":297,"props":382,"children":383},{},[384,389],{"type":26,"tag":111,"props":385,"children":386},{},[387],{"type":31,"value":388},"Tax breaks for early investors.",{"type":31,"value":390}," Reward the risk. The US QSBS model is a good starting point.",{"type":26,"tag":297,"props":392,"children":393},{},[394,399],{"type":26,"tag":111,"props":395,"children":396},{},[397],{"type":31,"value":398},"Real employee stock options.",{"type":31,"value":400}," Employees one through ten should be able to own a meaningful piece of the company without taxes eating the entire upside. And the cap has to match how large companies actually become — three million kronor is nothing when the target is unicorn valuation. Owning bigger pieces should pay off personally, not be punished.",{"type":26,"tag":297,"props":402,"children":403},{},[404,409],{"type":26,"tag":111,"props":405,"children":406},{},[407],{"type":31,"value":408},"Founder-friendly capital gains treatment.",{"type":31,"value":410}," That's where exits turn into the next generation's investments.",{"type":26,"tag":297,"props":412,"children":413},{},[414,419],{"type":26,"tag":111,"props":415,"children":416},{},[417],{"type":31,"value":418},"Buy-first policy in the public sector.",{"type":31,"value":420}," Smaller procurements sized so that young companies can actually win them.",{"type":26,"tag":27,"props":422,"children":423},{},[424],{"type":31,"value":425},"Companies that fly bring in tax revenue from customers all over the world. That's a massive net positive. Rewarding early-stage risk isn't a cost — it's an investment with extreme leverage.",{"type":26,"tag":53,"props":427,"children":428},{},[],{"type":26,"tag":57,"props":430,"children":432},{"id":431},"thanks-to-the-ones-who-dare",[433],{"type":31,"value":434},"Thanks to the ones who dare",{"type":26,"tag":27,"props":436,"children":437},{},[438],{"type":31,"value":439},"I want to thank the enterprise customers who choose to bet on new companies instead of building everything themselves with the consulting firms. You're the ones who actually create ecosystems. And thanks to everyone working inside enterprises who pushes \"let's try this little company\" — it's often individual people who open doors.",{"type":26,"tag":27,"props":441,"children":442},{},[443],{"type":31,"value":444},"And to politicians, both Swedish and Norwegian: think about how you can make it easier, more open, more inviting. Simplify. Reward risk. Fewer eighteen agencies seeking money from each other. More capital straight to the people who build.",{"type":26,"tag":53,"props":446,"children":447},{},[],{"type":26,"tag":57,"props":449,"children":451},{"id":450},"back-to-the-question",[452],{"type":31,"value":453},"Back to the question",{"type":26,"tag":27,"props":455,"children":456},{},[457],{"type":31,"value":458},"How does Oslo become Stockholm? By having Norwegian capital reinvested in Norway instead of disappearing into Tesla via the Oil Fund. By Norwegian enterprises starting to buy from Norwegian startups. By the first generation of Norwegian exits creating founders and employees who start over — with money, contacts, and scars.",{"type":26,"tag":27,"props":460,"children":461},{},[462],{"type":31,"value":463},"How does Stockholm become better than Silicon Valley? By not making the mistake Silicon Valley is stuck in now — too expensive, too bureaucratic in its own way, too centralized. Sweden has lower costs, a distributed talent pool, proximity to Europe, and a new generation of hungry founders. The only missing piece is for us to stop being so damn afraid of buying instead of building.",{"type":26,"tag":27,"props":465,"children":466},{},[467],{"type":31,"value":468},"There is an enormous amount left to do. And nothing is invented yet — people always think things are figured out, but they're not.",{"type":26,"tag":27,"props":470,"children":471},{},[472],{"type":31,"value":473},"Buy from startups. Reinvest your exits. Leave the consulting project and build something yourself. And see the small companies for what they're becoming — not for what they happen to be right now.",{"type":26,"tag":27,"props":475,"children":476},{},[477],{"type":31,"value":478},"That's how we win.",{"title":9,"searchDepth":480,"depth":480,"links":481},2,[482,483,484,485,486,487,488,489,490],{"id":59,"depth":480,"text":62},{"id":98,"depth":480,"text":101},{"id":163,"depth":480,"text":166},{"id":222,"depth":480,"text":225},{"id":253,"depth":480,"text":256},{"id":329,"depth":480,"text":332},{"id":358,"depth":480,"text":361},{"id":431,"depth":480,"text":434},{"id":450,"depth":480,"text":453},"markdown","content:article:en:how-oslo-becomes-stockholm-and-stockholm-becomes-silicon-valley.md","content","article/en/how-oslo-becomes-stockholm-and-stockholm-becomes-silicon-valley.md","article/en/how-oslo-becomes-stockholm-and-stockholm-becomes-silicon-valley","md",{"_path":498,"_dir":7,"_draft":8,"_partial":8,"_locale":9,"title":499,"description":500,"order":501,"titleClass":502,"cardClass":502,"translationSlug":503,"img":504,"date":505,"tag":506,"featured":8,"series":507,"seriesPart":508,"seriesTitle":509,"author":510,"body":511,"_type":491,"_id":1136,"_source":493,"_file":1137,"_stem":1138,"_extension":496},"/article/en/how-to-grow-a-rock","How to grow a rock","Scaling laws, the compute arms race, and the quadratic wall — how the AI industry learned that bigger models need bigger appetites.",7,"text-gradient-success","hur-man-odlar-en-sten","/img/grow-rock-hero.svg","2026-03-01","NLP","how-to-teach-a-rock",4,"How to teach a rock",{"name":19,"bio":20,"authorImage":21},{"type":23,"children":512,"toc":1125},[513,519,533,537,575,580,584,587,593,605,610,617,622,627,634,639,644,650,659,664,668,674,697,721,725,728,734,739,758,762,766,772,777,781,787,796,811,815,821,839,844,848,851,857,862,866,876,881,899,902,928,931,937,949,954,980,996,1010,1015,1020,1024,1027,1033,1038,1048,1065,1070,1089,1104,1120],{"type":26,"tag":57,"props":514,"children":516},{"id":515},"the-race-to-scale",[517],{"type":31,"value":518},"The Race to Scale",{"type":26,"tag":27,"props":520,"children":521},{},[522,524,531],{"type":31,"value":523},"In ",{"type":26,"tag":525,"props":526,"children":528},"a",{"href":527},"/articles/how-to-teach-a-rock-to-write",[529],{"type":31,"value":530},"Part 3",{"type":31,"value":532},", we saw the mechanism: predict the next token, roll the dice, repeat. Beautifully simple. But OpenAI didn't just build an elegant autocomplete — they built it at a scale nobody had attempted before. What followed was one of the most dramatic escalations in the history of technology.",{"type":26,"tag":534,"props":535,"children":536},"model-timeline",{},[],{"type":26,"tag":27,"props":538,"children":539},{},[540,542,547,549,554,556,561,562,566,568,573],{"type":31,"value":541},"In 2020, researchers discovered that model error follows a ",{"type":26,"tag":111,"props":543,"children":544},{},[545],{"type":31,"value":546},"smooth power law.",{"type":31,"value":548}," ",{"type":26,"tag":550,"props":551,"children":553},"source-ref",{"n":552},"4",[],{"type":31,"value":555}," Bigger model, better results. No cliff. No diminishing returns. And in 2022, DeepMind showed GPT-3 was actually ",{"type":26,"tag":111,"props":557,"children":558},{},[559],{"type":31,"value":560},"undertrained",{"type":31,"value":548},{"type":26,"tag":550,"props":563,"children":565},{"n":564},"5",[],{"type":31,"value":567}," — the optimal strategy wasn't just more parameters, it was more ",{"type":26,"tag":195,"props":569,"children":570},{},[571],{"type":31,"value":572},"data",{"type":31,"value":574},". Every lab in the world started running.",{"type":26,"tag":27,"props":576,"children":577},{},[578],{"type":31,"value":579},"What does that kind of growth actually look like inside a model? Each building below is one GPT generation. The foundation bricks are attention heads — one per head, sized by the hidden dimension. The tower above is every transformer layer stacked on top. Watch GPT-1's modest 12-layer building get dwarfed as each generation adds more heads, wider layers, and deeper stacks.",{"type":26,"tag":581,"props":582,"children":583},"model-scale-building",{},[],{"type":26,"tag":53,"props":585,"children":586},{},[],{"type":26,"tag":57,"props":588,"children":590},{"id":589},"breaking-the-wall",[591],{"type":31,"value":592},"Breaking the Wall",{"type":26,"tag":27,"props":594,"children":595},{},[596,597,603],{"type":31,"value":523},{"type":26,"tag":525,"props":598,"children":600},{"href":599},"/articles/how-to-teach-a-rock-to-understand",[601],{"type":31,"value":602},"Part 2",{"type":31,"value":604}," we saw how attention works: every word produces a Query, Key, and Value, dot-products find which words matter, and the weighted Values become the output. Every word attends to every other word — an n × n matrix of scores.",{"type":26,"tag":27,"props":606,"children":607},{},[608],{"type":31,"value":609},"That matrix is the wall. 6 words = 36 scores. 1,000 words = 1 million. 1 million words = 1 trillion. Double the context, quadruple the cost. GPT-3 computes 9,216 of these matrices (96 heads × 96 layers), each one n × n. Modern models handle 1 million tokens of context. At that scale, full attention is physically impossible.",{"type":26,"tag":611,"props":612,"children":614},"h3",{"id":613},"breaking-the-matrix",[615],{"type":31,"value":616},"Breaking the Matrix",{"type":26,"tag":27,"props":618,"children":619},{},[620],{"type":31,"value":621},"The first problem is the attention computation itself — the n × n score matrix.",{"type":26,"tag":623,"props":624,"children":626},"attention-complexity-curves",{":variants":625},"[\"full\", \"sliding\", \"flash\"]",[],{"type":26,"tag":628,"props":629,"children":631},"h4",{"id":630},"full-attention-the-baseline",[632],{"type":31,"value":633},"Full Attention — the baseline",{"type":26,"tag":27,"props":635,"children":636},{},[637],{"type":31,"value":638},"Standard self-attention computes a score between every pair of tokens. For a sequence of n tokens, that's an n × n matrix — one entry for every possible connection. It's complete: nothing is missed. But the cost is quadratic.",{"type":26,"tag":640,"props":641,"children":643},"attention-variant-detail",{"variant":642},"full",[],{"type":26,"tag":628,"props":645,"children":647},{"id":646},"sliding-window-attention",[648],{"type":31,"value":649},"Sliding Window Attention",{"type":26,"tag":27,"props":651,"children":652},{},[653,655],{"type":31,"value":654},"The simplest fix: limit each token to its w nearest neighbors. Mistral uses a window of 4,096 tokens. The cost drops from O(n²) to O(n × w) — linear in context length. ",{"type":26,"tag":550,"props":656,"children":658},{"n":657},"11",[],{"type":26,"tag":27,"props":660,"children":661},{},[662],{"type":31,"value":663},"The trade-off is real: tokens outside the window are invisible. But information still propagates through layers. With 32 transformer blocks and a window of 4,096, the effective receptive field reaches ~131K tokens — each layer passes information one window-width further.",{"type":26,"tag":640,"props":665,"children":667},{"variant":666},"sliding",[],{"type":26,"tag":628,"props":669,"children":671},{"id":670},"flash-attention",[672],{"type":31,"value":673},"Flash Attention",{"type":26,"tag":27,"props":675,"children":676},{},[677,679,684,686,691,693],{"type":31,"value":678},"Flash Attention doesn't change ",{"type":26,"tag":195,"props":680,"children":681},{},[682],{"type":31,"value":683},"what",{"type":31,"value":685}," the model computes — the math is identical to full attention. It changes ",{"type":26,"tag":195,"props":687,"children":688},{},[689],{"type":31,"value":690},"how",{"type":31,"value":692},": by tiling the Q, K, V matrices into small blocks that fit in the GPU's fast on-chip SRAM (~20 MB), it avoids ever writing the full n × n attention matrix to slow HBM (GPU main memory). ",{"type":26,"tag":550,"props":694,"children":696},{"n":695},"7",[],{"type":26,"tag":27,"props":698,"children":699},{},[700,702,705,707,712,714,719],{"type":31,"value":701},"The result: O(n) memory instead of O(n²), and 3× end-to-end speedup on GPT-2 (up to 7.6× on the attention computation alone). ",{"type":26,"tag":550,"props":703,"children":704},{"n":695},[],{"type":31,"value":706}," The key insight is ",{"type":26,"tag":195,"props":708,"children":709},{},[710],{"type":31,"value":711},"IO-awareness",{"type":31,"value":713}," — the bottleneck isn't compute, it's memory bandwidth. Flash Attention computes ",{"type":26,"tag":195,"props":715,"children":716},{},[717],{"type":31,"value":718},"exact",{"type":31,"value":720}," results — not an approximation — with a fraction of the memory traffic. Today, virtually every large model uses Flash Attention. It's table stakes.",{"type":26,"tag":640,"props":722,"children":724},{"variant":723},"flash",[],{"type":26,"tag":53,"props":726,"children":727},{},[],{"type":26,"tag":611,"props":729,"children":731},{"id":730},"shrinking-the-cache",[732],{"type":31,"value":733},"Shrinking the Cache",{"type":26,"tag":27,"props":735,"children":736},{},[737],{"type":31,"value":738},"Flash Attention solved the training bottleneck. But there's a second wall that Flash doesn't touch.",{"type":26,"tag":27,"props":740,"children":741},{},[742,744,749,751,756],{"type":31,"value":743},"When a model ",{"type":26,"tag":195,"props":745,"children":746},{},[747],{"type":31,"value":748},"generates",{"type":31,"value":750}," text — one token at a time — it caches every previous token's keys and values so it doesn't have to recompute them. This is the ",{"type":26,"tag":111,"props":752,"children":753},{},[754],{"type":31,"value":755},"KV cache",{"type":31,"value":757},", and it grows linearly with every token produced. For a 96-head model generating a 128K-token response, that's 96 separate K and V tensors, each growing with every single token. Flash Attention can't shrink this. It's a completely different bottleneck — one that only matters during inference, not training.",{"type":26,"tag":759,"props":760,"children":761},"kv-cache-growth",{},[],{"type":26,"tag":623,"props":763,"children":765},{":variants":764},"[\"full\", \"mqa\", \"gqa\", \"mla\"]",[],{"type":26,"tag":628,"props":767,"children":769},{"id":768},"multi-query-attention-mqa",[770],{"type":31,"value":771},"Multi-Query Attention (MQA)",{"type":26,"tag":27,"props":773,"children":774},{},[775],{"type":31,"value":776},"The bluntest fix: share a single set of keys and values across all 96 query heads. The KV cache drops by 96×. The trade-off: some quality degradation and training instability, since all heads now read from the same information.",{"type":26,"tag":640,"props":778,"children":780},{"variant":779},"mqa",[],{"type":26,"tag":628,"props":782,"children":784},{"id":783},"grouped-query-attention-gqa",[785],{"type":31,"value":786},"Grouped-Query Attention (GQA)",{"type":26,"tag":27,"props":788,"children":789},{},[790,792],{"type":31,"value":791},"The compromise. Instead of one shared KV head (MQA) or 96 independent ones (full MHA), GQA divides the query heads into groups — typically 8. Each group shares one set of keys and values. ",{"type":26,"tag":550,"props":793,"children":795},{"n":794},"6",[],{"type":26,"tag":27,"props":797,"children":798},{},[799,801,804,806,809],{"type":31,"value":800},"Llama 3 uses GQA with 8 KV heads. ",{"type":26,"tag":550,"props":802,"children":803},{"n":794},[],{"type":31,"value":805}," Mistral uses GQA with 8 KV heads. ",{"type":26,"tag":550,"props":807,"children":808},{"n":657},[],{"type":31,"value":810}," It preserves most of the quality of full multi-head attention while capturing most of the speed gains of MQA — the sweet spot that the industry converged on.",{"type":26,"tag":640,"props":812,"children":814},{"variant":813},"gqa",[],{"type":26,"tag":628,"props":816,"children":818},{"id":817},"multi-head-latent-attention-mla",[819],{"type":31,"value":820},"Multi-Head Latent Attention (MLA)",{"type":26,"tag":27,"props":822,"children":823},{},[824,826,831,833,837],{"type":31,"value":825},"DeepSeek took a different approach entirely. Instead of sharing KV heads, MLA ",{"type":26,"tag":195,"props":827,"children":828},{},[829],{"type":31,"value":830},"compresses",{"type":31,"value":832}," the keys and values into a learned low-dimensional latent space — 512 dimensions instead of 14,000. ",{"type":26,"tag":550,"props":834,"children":836},{"n":835},"12",[],{"type":31,"value":838}," The model learns what information to keep and what to discard.",{"type":26,"tag":27,"props":840,"children":841},{},[842],{"type":31,"value":843},"The KV cache drops from 213 GB to 7.6 GB. Unlike MQA's blunt sharing, MLA preserves per-head expressiveness through the learned compression. DeepSeek V2, V3, and R1 all use MLA — it's arguably the most important attention innovation since Flash Attention.",{"type":26,"tag":640,"props":845,"children":847},{"variant":846},"mla",[],{"type":26,"tag":53,"props":849,"children":850},{},[],{"type":26,"tag":611,"props":852,"children":854},{"id":853},"replacing-attention-entirely",[855],{"type":31,"value":856},"Replacing Attention Entirely",{"type":26,"tag":27,"props":858,"children":859},{},[860],{"type":31,"value":861},"Some researchers asked a more radical question: what if we skip attention altogether?",{"type":26,"tag":863,"props":864,"children":865},"mamba-vs-attention",{},[],{"type":26,"tag":27,"props":867,"children":868},{},[869,874],{"type":26,"tag":111,"props":870,"children":871},{},[872],{"type":31,"value":873},"State-space models",{"type":31,"value":875}," like Mamba process sequences in linear time — no n × n matrix, no KV cache. They maintain a fixed-size hidden state that gets updated with each token, like a rolling summary. The cost is constant per token regardless of how long the sequence is.",{"type":26,"tag":27,"props":877,"children":878},{},[879],{"type":31,"value":880},"The catch: pure SSMs struggle with precise recall. If you need to find one specific fact buried in 100,000 tokens, the fixed-size state can't always hold it. Attention excels at exactly this — reaching back to any specific position.",{"type":26,"tag":27,"props":882,"children":883},{},[884,886,891,893,897],{"type":31,"value":885},"The solution: ",{"type":26,"tag":111,"props":887,"children":888},{},[889],{"type":31,"value":890},"hybrid architectures",{"type":31,"value":892},". NVIDIA's Nemotron 3 replaces most layers with Mamba-2 and keeps only a few GQA attention layers (with just 2 KV heads) for precise retrieval. ",{"type":26,"tag":550,"props":894,"children":896},{"n":895},"14",[],{"type":31,"value":898}," The result: 1M-token context with 3.3× higher throughput than a pure transformer of similar size.",{"type":26,"tag":53,"props":900,"children":901},{},[],{"type":26,"tag":27,"props":903,"children":904},{},[905,907,910,912,915,917,921,923,926],{"type":31,"value":906},"In practice, modern models stack all three strategies. Llama 3 uses Flash Attention + GQA. ",{"type":26,"tag":550,"props":908,"children":909},{"n":794},[],{"type":31,"value":911}," DeepSeek V3 combines Flash + MLA + MoE. ",{"type":26,"tag":550,"props":913,"children":914},{"n":835},[],{"type":31,"value":916}," GLM-4 uses Flash Attention + GQA for 128K–1M token contexts. ",{"type":26,"tag":550,"props":918,"children":920},{"n":919},"13",[],{"type":31,"value":922}," Nemotron 3 uses Mamba-2 + GQA + MoE. ",{"type":26,"tag":550,"props":924,"children":925},{"n":895},[],{"type":31,"value":927}," The quadratic wall didn't fall in one blow. It was chipped away from every angle until it stopped mattering.",{"type":26,"tag":53,"props":929,"children":930},{},[],{"type":26,"tag":57,"props":932,"children":934},{"id":933},"shrinking-the-numbers",[935],{"type":31,"value":936},"Shrinking the Numbers",{"type":26,"tag":27,"props":938,"children":939},{},[940,942,947],{"type":31,"value":941},"Even with faster attention, there's a blunter cost: memory. Every parameter is a floating-point number. At full precision (FP32), each one takes 4 bytes. GPT-3's 175 billion parameters at FP32 = ",{"type":26,"tag":111,"props":943,"children":944},{},[945],{"type":31,"value":946},"700 GB",{"type":31,"value":948}," — more than any single GPU can hold.",{"type":26,"tag":27,"props":950,"children":951},{},[952],{"type":31,"value":953},"The first trick: use smaller numbers during training. FP16 (16-bit floating point) cuts memory in half. But FP16 has a narrow dynamic range — gradients can overflow or underflow mid-training. BF16 (bfloat16) solved this: it keeps FP32's 8-bit exponent (same range) but shrinks the mantissa (less precision). The trade-off: you lose some decimal accuracy but the numbers never blow up. Google designed BF16 specifically for deep learning, and by 2022 it was the default for most large model training.",{"type":26,"tag":27,"props":955,"children":956},{},[957,959,964,966,971,973,978],{"type":31,"value":958},"In practice, training uses both: the forward and backward passes run in BF16 for speed, but a master copy of the weights stays in FP32. The model ",{"type":26,"tag":195,"props":960,"children":961},{},[962],{"type":31,"value":963},"thinks",{"type":31,"value":965}," in low precision but ",{"type":26,"tag":195,"props":967,"children":968},{},[969],{"type":31,"value":970},"remembers",{"type":31,"value":972}," in full precision. This is ",{"type":26,"tag":111,"props":974,"children":975},{},[976],{"type":31,"value":977},"mixed-precision training",{"type":31,"value":979},".",{"type":26,"tag":27,"props":981,"children":982},{},[983,985,990,992],{"type":31,"value":984},"After training, you can compress further. ",{"type":26,"tag":111,"props":986,"children":987},{},[988],{"type":31,"value":989},"INT8 quantization",{"type":31,"value":991}," maps floating-point weights to 8-bit integers — 4× smaller than FP32, 2× smaller than FP16. Dettmers et al. showed this works on models up to 175B parameters with virtually no performance loss, using a clever trick: the ~0.1% of weights with extreme values stay in FP16, while the other 99.9% compress to INT8. ",{"type":26,"tag":550,"props":993,"children":995},{"n":994},"9",[],{"type":26,"tag":27,"props":997,"children":998},{},[999,1004,1006],{"type":26,"tag":111,"props":1000,"children":1001},{},[1002],{"type":31,"value":1003},"INT4",{"type":31,"value":1005}," pushes further — 8× compression from FP32. GPTQ showed you can compress a 175B model to 3–4 bits per parameter and run it on a single GPU for the first time. ",{"type":26,"tag":550,"props":1007,"children":1009},{"n":1008},"10",[],{"type":26,"tag":27,"props":1011,"children":1012},{},[1013],{"type":31,"value":1014},"A 70-billion parameter model that once required a server cluster now fits on a laptop with a gaming GPU. Quantization didn't just make AI cheaper — it democratized it.",{"type":26,"tag":27,"props":1016,"children":1017},{},[1018],{"type":31,"value":1019},"Drag the slider to see how model size and precision format change the memory bill — and which hardware can actually hold the result.",{"type":26,"tag":1021,"props":1022,"children":1023},"quantization-memory",{},[],{"type":26,"tag":53,"props":1025,"children":1026},{},[],{"type":26,"tag":57,"props":1028,"children":1030},{"id":1029},"what-a-token-really-is",[1031],{"type":31,"value":1032},"What a token really is",{"type":26,"tag":27,"props":1034,"children":1035},{},[1036],{"type":31,"value":1037},"By 2020, the AI world had split in two.",{"type":26,"tag":27,"props":1039,"children":1040},{},[1041,1046],{"type":26,"tag":111,"props":1042,"children":1043},{},[1044],{"type":31,"value":1045},"Encoders",{"type":31,"value":1047}," like BERT: narrow tasks, short contexts, safe, reliable. You fine-tuned one model per problem and slept well at night.",{"type":26,"tag":27,"props":1049,"children":1050},{},[1051,1056,1058,1063],{"type":26,"tag":111,"props":1052,"children":1053},{},[1054],{"type":31,"value":1055},"Decoders",{"type":31,"value":1057}," like GPT-3: could do almost anything. Not ",{"type":26,"tag":195,"props":1059,"children":1060},{},[1061],{"type":31,"value":1062},"reliably",{"type":31,"value":1064},", but the range was staggering. Poetry, Python, legal briefs, meatball recipes — all from one model, no fine-tuning required. The ultimate autocomplete — stunningly capable, completely unreliable.",{"type":26,"tag":27,"props":1066,"children":1067},{},[1068],{"type":31,"value":1069},"But something else was quietly brewing in the architecture.",{"type":26,"tag":27,"props":1071,"children":1072},{},[1073,1075,1080,1082,1087],{"type":31,"value":1074},"Every Transformer — encoder, decoder, text-to-text — speaks the same language: ",{"type":26,"tag":111,"props":1076,"children":1077},{},[1078],{"type":31,"value":1079},"tokens",{"type":31,"value":1081},". And a token is just a number. Nothing in the math ",{"type":26,"tag":195,"props":1083,"children":1084},{},[1085],{"type":31,"value":1086},"requires",{"type":31,"value":1088}," it to represent a word.",{"type":26,"tag":27,"props":1090,"children":1091},{},[1092],{"type":26,"tag":111,"props":1093,"children":1094},{},[1095,1097,1102],{"type":31,"value":1096},"We taught the rock to read. We taught it to write. We grew it until the world noticed. What happens when we teach it to listen? To ",{"type":26,"tag":195,"props":1098,"children":1099},{},[1100],{"type":31,"value":1101},"see",{"type":31,"value":1103},"?",{"type":26,"tag":27,"props":1105,"children":1106},{},[1107],{"type":26,"tag":111,"props":1108,"children":1109},{},[1110,1112,1118],{"type":31,"value":1111},"Read ",{"type":26,"tag":525,"props":1113,"children":1115},{"href":1114},"/articles/how-to-teach-a-rock-to-see",[1116],{"type":31,"value":1117},"Part 5: How to teach a rock to see",{"type":31,"value":1119}," →",{"type":26,"tag":1121,"props":1122,"children":1124},"sources",{":sources":1123},"[{\"n\":1,\"author\":\"Radford, A. et al. — OpenAI\",\"title\":\"Improving Language Understanding by Generative Pre-Training (GPT-1)\",\"url\":\"https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf\"},{\"n\":2,\"author\":\"Radford, A. et al. — OpenAI\",\"title\":\"Language Models are Unsupervised Multitask Learners (GPT-2)\",\"url\":\"https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf\"},{\"n\":3,\"author\":\"Brown, T. et al. — OpenAI\",\"title\":\"Language Models are Few-Shot Learners (GPT-3)\",\"url\":\"https://arxiv.org/abs/2005.14165\"},{\"n\":4,\"author\":\"Kaplan, J. et al. — OpenAI\",\"title\":\"Scaling Laws for Neural Language Models\",\"url\":\"https://arxiv.org/abs/2001.08361\"},{\"n\":5,\"author\":\"Hoffmann, J. et al. — DeepMind\",\"title\":\"Training Compute-Optimal Large Language Models (Chinchilla)\",\"url\":\"https://arxiv.org/abs/2203.15556\"},{\"n\":6,\"author\":\"Dubey, A. et al. — Meta AI\",\"title\":\"The Llama 3 Herd of Models\",\"url\":\"https://arxiv.org/abs/2407.21783\"},{\"n\":7,\"author\":\"Dao, T. et al. — Stanford\",\"title\":\"FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness\",\"url\":\"https://arxiv.org/abs/2205.14135\"},{\"n\":8,\"author\":\"Epoch AI\",\"title\":\"Notable AI Models — Parameter counts and training compute\",\"url\":\"https://epoch.ai/data/notable-ai-models\"},{\"n\":9,\"author\":\"Dettmers, T. et al. — University of Washington\",\"title\":\"LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale\",\"url\":\"https://arxiv.org/abs/2208.07339\"},{\"n\":10,\"author\":\"Frantar, E. et al. — IST Austria\",\"title\":\"GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers\",\"url\":\"https://arxiv.org/abs/2210.17323\"},{\"n\":11,\"author\":\"Jiang, A. et al. — Mistral AI\",\"title\":\"Mistral 7B\",\"url\":\"https://arxiv.org/abs/2310.06825\"},{\"n\":12,\"author\":\"DeepSeek-AI\",\"title\":\"DeepSeek-V3 Technical Report\",\"url\":\"https://arxiv.org/abs/2412.19437\"},{\"n\":13,\"author\":\"GLM Team — Zhipu AI\",\"title\":\"ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools\",\"url\":\"https://arxiv.org/abs/2406.12793\"},{\"n\":14,\"author\":\"NVIDIA\",\"title\":\"NVIDIA Nemotron 3: Efficient and Open Intelligence\",\"url\":\"https://arxiv.org/abs/2512.20856\"}]",[],{"title":9,"searchDepth":480,"depth":480,"links":1126},[1127,1128,1134,1135],{"id":515,"depth":480,"text":518},{"id":589,"depth":480,"text":592,"children":1129},[1130,1132,1133],{"id":613,"depth":1131,"text":616},3,{"id":730,"depth":1131,"text":733},{"id":853,"depth":1131,"text":856},{"id":933,"depth":480,"text":936},{"id":1029,"depth":480,"text":1032},"content:article:en:how-to-grow-a-rock.md","article/en/how-to-grow-a-rock.md","article/en/how-to-grow-a-rock",{"_path":1140,"_dir":7,"_draft":8,"_partial":8,"_locale":9,"title":1141,"description":1142,"order":1143,"titleClass":502,"cardClass":502,"translationSlug":1144,"img":1145,"date":505,"tag":506,"featured":8,"series":507,"seriesPart":1146,"seriesTitle":509,"author":1147,"body":1148,"_type":491,"_id":1510,"_source":493,"_file":1511,"_stem":1512,"_extension":496},"/article/en/how-to-teach-a-rock-to-see","How to teach a rock to see","From pixels to poetry — how one architecture learned to see, hear, and dream.",8,"hur-man-lar-en-sten-att-se","/img/teach-rock-see-hero.svg",5,{"name":19,"bio":20,"authorImage":21},{"type":23,"children":1149,"toc":1502},[1150,1156,1167,1177,1208,1212,1224,1227,1233,1250,1255,1265,1269,1274,1278,1281,1287,1317,1321,1331,1334,1340,1359,1363,1374,1390,1393,1399,1409,1413,1443,1446,1452,1457,1473,1484,1498],{"type":26,"tag":57,"props":1151,"children":1153},{"id":1152},"the-universal-language",[1154],{"type":31,"value":1155},"The universal language",{"type":26,"tag":27,"props":1157,"children":1158},{},[1159,1165],{"type":26,"tag":525,"props":1160,"children":1162},{"href":1161},"/articles/how-to-grow-a-rock",[1163],{"type":31,"value":1164},"Part 4",{"type":31,"value":1166}," ended with a quiet revelation. Every Transformer — encoder, decoder, text-to-text — speaks the same language: tokens. And a token is just a number.",{"type":26,"tag":27,"props":1168,"children":1169},{},[1170,1172],{"type":31,"value":1171},"Nothing in the math requires that number to represent a word. It could represent a pixel. A sound. A frame of video. Researchers looked at the Transformer and asked: ",{"type":26,"tag":195,"props":1173,"children":1174},{},[1175],{"type":31,"value":1176},"what if we just... feed it something else?",{"type":26,"tag":27,"props":1178,"children":1179},{},[1180,1182,1186,1188,1193,1195,1200,1201,1206],{"type":31,"value":1181},"But what does \"token\" actually mean for text? Computers don't see words the way we do. They break text into ",{"type":26,"tag":111,"props":1183,"children":1184},{},[1185],{"type":31,"value":1079},{"type":31,"value":1187}," — chunks that might be whole words, word pieces, or even individual characters. The word \"unhappiest\" might become three pieces: ",{"type":26,"tag":195,"props":1189,"children":1190},{},[1191],{"type":31,"value":1192},"\"un\"",{"type":31,"value":1194},", ",{"type":26,"tag":195,"props":1196,"children":1197},{},[1198],{"type":31,"value":1199},"\"happi\"",{"type":31,"value":1194},{"type":26,"tag":195,"props":1202,"children":1203},{},[1204],{"type":31,"value":1205},"\"est\"",{"type":31,"value":1207},". Common words like \"the\" stay whole. Rare words get split. Different models split differently — two models looking at the same sentence can produce completely different tokens.",{"type":26,"tag":1209,"props":1210,"children":1211},"tokenizer-demo",{},[],{"type":26,"tag":27,"props":1213,"children":1214},{},[1215,1217,1222],{"type":31,"value":1216},"That's the text side. But the Transformer doesn't care ",{"type":26,"tag":195,"props":1218,"children":1219},{},[1220],{"type":31,"value":1221},"where",{"type":31,"value":1223}," the numbers come from.",{"type":26,"tag":53,"props":1225,"children":1226},{},[],{"type":26,"tag":57,"props":1228,"children":1230},{"id":1229},"teaching-a-rock-to-see",[1231],{"type":31,"value":1232},"Teaching a rock to see",{"type":26,"tag":27,"props":1234,"children":1235},{},[1236,1238,1244,1246],{"type":31,"value":1237},"The answer turned out to be almost embarrassingly simple. Take a photo. Chop it into a grid of 16×16 patches. Flatten each patch into a vector — exactly like a word embedding from ",{"type":26,"tag":525,"props":1239,"children":1241},{"href":1240},"/articles/how-to-teach-a-rock-words",[1242],{"type":31,"value":1243},"Part 1",{"type":31,"value":1245},". Feed the sequence into a Transformer. ",{"type":26,"tag":550,"props":1247,"children":1249},{"n":1248},"1",[],{"type":26,"tag":27,"props":1251,"children":1252},{},[1253],{"type":31,"value":1254},"Same attention mechanism from Part 2. Same architecture. But now, instead of words attending to words, image patches attend to image patches. The cat's ear learns that the whiskers matter. The sky learns to ignore the ground.",{"type":26,"tag":27,"props":1256,"children":1257},{},[1258,1260],{"type":31,"value":1259},"The paper's title said it all: ",{"type":26,"tag":195,"props":1261,"children":1262},{},[1263],{"type":31,"value":1264},"\"An Image is Worth 16×16 Words.\"",{"type":26,"tag":1266,"props":1267,"children":1268},"image-to-tokens",{},[],{"type":26,"tag":27,"props":1270,"children":1271},{},[1272],{"type":31,"value":1273},"And it worked. Not sort-of worked — it matched or beat the best image classifiers in the world, models that had been purpose-built for vision over a decade. The Transformer didn't care that these weren't words. Tokens are tokens.",{"type":26,"tag":1275,"props":1276,"children":1277},"vi-t-vector-space",{},[],{"type":26,"tag":53,"props":1279,"children":1280},{},[],{"type":26,"tag":57,"props":1282,"children":1284},{"id":1283},"connecting-eyes-and-ears",[1285],{"type":31,"value":1286},"Connecting eyes and ears",{"type":26,"tag":27,"props":1288,"children":1289},{},[1290,1292,1297,1299,1303,1305,1309,1311,1315],{"type":31,"value":1291},"If images, audio, and text are all just sequences of numbers, could you put them in the ",{"type":26,"tag":195,"props":1293,"children":1294},{},[1295],{"type":31,"value":1296},"same",{"type":31,"value":1298}," space? OpenAI's CLIP ",{"type":26,"tag":550,"props":1300,"children":1302},{"n":1301},"2",[],{"type":31,"value":1304}," trained two encoders — one for images, one for text — pushing matching pairs close together across 400 million image-caption pairs. The result was the vector space from ",{"type":26,"tag":525,"props":1306,"children":1307},{"href":1240},[1308],{"type":31,"value":1243},{"type":31,"value":1310}," — but now words ",{"type":26,"tag":195,"props":1312,"children":1313},{},[1314],{"type":31,"value":321},{"type":31,"value":1316}," images lived in it.",{"type":26,"tag":1318,"props":1319,"children":1320},"clip-vector-space",{},[],{"type":26,"tag":27,"props":1322,"children":1323},{},[1324,1326,1329],{"type":31,"value":1325},"Whisper ",{"type":26,"tag":550,"props":1327,"children":1328},{"n":552},[],{"type":31,"value":1330}," took it further: point the encoder-decoder Transformer at spectrograms and let it \"translate\" speech into text. The same architecture that translated English to French, now translating sound to words.",{"type":26,"tag":53,"props":1332,"children":1333},{},[],{"type":26,"tag":57,"props":1335,"children":1337},{"id":1336},"looking-into-the-space",[1338],{"type":31,"value":1339},"Looking into the space",{"type":26,"tag":27,"props":1341,"children":1342},{},[1343,1345,1350,1352,1357],{"type":31,"value":1344},"With images and text in the same space, we could do something new: look ",{"type":26,"tag":195,"props":1346,"children":1347},{},[1348],{"type":31,"value":1349},"between",{"type":31,"value":1351}," concepts. What lives in the middle of \"lemon,\" \"dwarf,\" and an image of a robot? In 2021, my team at Labelf tried to find out. We hooked BigGAN — an image generator from 2018 — up to CLIP. CLIP picks a position in the multimodal space based on the prompt \"lemon dwarf robot,\" and BigGAN tries to paint what that position looks like. Frame by frame, CLIP steers, BigGAN renders. (BigGAN is old and a mediocre painter — the visuals are an ",{"type":26,"tag":195,"props":1353,"children":1354},{},[1355],{"type":31,"value":1356},"approximation",{"type":31,"value":1358}," of what the space contains, not a perfect rendering. Don't fixate on the artifacts.)",{"type":26,"tag":1360,"props":1361,"children":1362},"clip-dream-steering",{},[],{"type":26,"tag":27,"props":1364,"children":1365},{},[1366,1368,1372],{"type":31,"value":1367},"But look past BigGAN's limitations and watch the scales: the lemon-scale, the dwarf-scale, the robot-scale. The sphere never goes fully to one concept — it always retains traces of the others. You're watching the geometry of the space move. All the patterns, all the connections between concepts — that's what lives in this geometry. And for the first time, we could actually ",{"type":26,"tag":195,"props":1369,"children":1370},{},[1371],{"type":31,"value":1101},{"type":31,"value":1373}," it.",{"type":26,"tag":27,"props":1375,"children":1376},{},[1377,1379,1383,1385,1388],{"type":31,"value":1378},"DALL-E ",{"type":26,"tag":550,"props":1380,"children":1382},{"n":1381},"3",[],{"type":31,"value":1384}," went further: text in, image out. Stable Diffusion ",{"type":26,"tag":550,"props":1386,"children":1387},{"n":564},[],{"type":31,"value":1389}," made it open-source and fast enough to run on a laptop. The Transformer wasn't just reading the world. It was drawing it.",{"type":26,"tag":53,"props":1391,"children":1392},{},[],{"type":26,"tag":57,"props":1394,"children":1396},{"id":1395},"teaching-a-rock-to-hear",[1397],{"type":31,"value":1398},"Teaching a rock to hear",{"type":26,"tag":27,"props":1400,"children":1401},{},[1402,1404,1407],{"type":31,"value":1403},"Sound is just a grid of frequencies. Convert audio to a mel spectrogram — a heatmap of time vs. frequency — and it looks like an image. The Audio Spectrogram Transformer ",{"type":26,"tag":550,"props":1405,"children":1406},{"n":794},[],{"type":31,"value":1408}," did exactly what ViT did: chop it into 16×16 patches and feed them into a Transformer. Same architecture, no audio-specific tricks. Tokens are tokens.",{"type":26,"tag":1410,"props":1411,"children":1412},"audio-to-tokens",{},[],{"type":26,"tag":27,"props":1414,"children":1415},{},[1416,1418,1421,1423,1428,1430,1435,1437,1441],{"type":31,"value":1417},"Meta's MusicGen ",{"type":26,"tag":550,"props":1419,"children":1420},{"n":695},[],{"type":31,"value":1422}," flipped it: instead of an encoder ",{"type":26,"tag":195,"props":1424,"children":1425},{},[1426],{"type":31,"value":1427},"reading",{"type":31,"value":1429}," audio tokens, a decoder ",{"type":26,"tag":195,"props":1431,"children":1432},{},[1433],{"type":31,"value":1434},"writes",{"type":31,"value":1436}," them — predicting the next one autoregressively, exactly like GPT predicts the next word. Same architecture as ",{"type":26,"tag":525,"props":1438,"children":1439},{"href":527},[1440],{"type":31,"value":530},{"type":31,"value":1442},". Different tokens.",{"type":26,"tag":53,"props":1444,"children":1445},{},[],{"type":26,"tag":57,"props":1447,"children":1449},{"id":1448},"one-architecture-every-sense",[1450],{"type":31,"value":1451},"One architecture, every sense",{"type":26,"tag":27,"props":1453,"children":1454},{},[1455],{"type":31,"value":1456},"By 2022 the same Transformer architecture — unchanged since 2017 — was reading text, classifying images, transcribing speech, and generating art. No one redesigned it. They just changed what the tokens represented. With multimodality solved and scaling laws in full bloom, there were no fundamental breakthroughs left to wait for.",{"type":26,"tag":1458,"props":1459,"children":1460},"impact-statement",{},[1461],{"type":26,"tag":27,"props":1462,"children":1463},{},[1464,1466,1471],{"type":31,"value":1465},"It was a ",{"type":26,"tag":111,"props":1467,"children":1468},{},[1469],{"type":31,"value":1470},"time, steering, data, and funding",{"type":31,"value":1472}," game now.",{"type":26,"tag":27,"props":1474,"children":1475},{},[1476,1478,1483],{"type":31,"value":1477},"But the rock still didn't know you were talking to it. What happens when someone teaches it to ",{"type":26,"tag":195,"props":1479,"children":1480},{},[1481],{"type":31,"value":1482},"listen",{"type":31,"value":1103},{"type":26,"tag":27,"props":1485,"children":1486},{},[1487],{"type":26,"tag":111,"props":1488,"children":1489},{},[1490,1491,1497],{"type":31,"value":1111},{"type":26,"tag":525,"props":1492,"children":1494},{"href":1493},"/articles/how-to-teach-a-rock-to-talk",[1495],{"type":31,"value":1496},"Part 6: How to teach a rock to talk",{"type":31,"value":1119},{"type":26,"tag":1121,"props":1499,"children":1501},{":sources":1500},"[{\"n\":1,\"author\":\"Dosovitskiy, A. et al. — Google Research\",\"title\":\"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT)\",\"url\":\"https://arxiv.org/abs/2010.11929\"},{\"n\":2,\"author\":\"Radford, A. et al. — OpenAI\",\"title\":\"Learning Transferable Visual Models From Natural Language Supervision (CLIP)\",\"url\":\"https://arxiv.org/abs/2103.00020\"},{\"n\":3,\"author\":\"Ramesh, A. et al. — OpenAI\",\"title\":\"Zero-Shot Text-to-Image Generation (DALL-E)\",\"url\":\"https://arxiv.org/abs/2102.12092\"},{\"n\":4,\"author\":\"Radford, A. et al. — OpenAI\",\"title\":\"Robust Speech Recognition via Large-Scale Weak Supervision (Whisper)\",\"url\":\"https://arxiv.org/abs/2212.04356\"},{\"n\":5,\"author\":\"Rombach, R. et al. — CompVis / Stability AI\",\"title\":\"High-Resolution Image Synthesis with Latent Diffusion Models (Stable Diffusion)\",\"url\":\"https://arxiv.org/abs/2112.10752\"},{\"n\":6,\"author\":\"Gong, Y., Chung, Y.-A. & Glass, J. — MIT CSAIL\",\"title\":\"AST: Audio Spectrogram Transformer (Interspeech 2021)\",\"url\":\"https://arxiv.org/abs/2104.01778\"},{\"n\":7,\"author\":\"Copet, J. et al. — Meta AI\",\"title\":\"Simple and Controllable Music Generation (MusicGen)\",\"url\":\"https://arxiv.org/abs/2306.05284\"}]",[],{"title":9,"searchDepth":480,"depth":480,"links":1503},[1504,1505,1506,1507,1508,1509],{"id":1152,"depth":480,"text":1155},{"id":1229,"depth":480,"text":1232},{"id":1283,"depth":480,"text":1286},{"id":1336,"depth":480,"text":1339},{"id":1395,"depth":480,"text":1398},{"id":1448,"depth":480,"text":1451},"content:article:en:how-to-teach-a-rock-to-see.md","article/en/how-to-teach-a-rock-to-see.md","article/en/how-to-teach-a-rock-to-see",1779221120785]