p1_01_dark_library

In Issue 6, we watched neural networks learn to see. They could look at a photograph and say "that's a cat." Vision was conquered. But language? Language was a different beast.

TAPE: RNNs read one word at a time. Left to right. RNNs read one word at a time. Left to right.

Each word updated a small bundle of numbers called a hidden state -- the model's attempt to remember everything it had read so far. Like rewriting a summary after every chapter. By chapter 20, the details from chapter 1 have faded.

p1_05_radiance_library
CLOCKWORK: Then eight researchers changed everything. Then eight researchers changed everything.
p1_07_attention_thread

An improvement called LSTM (Long Short-Term Memory, 1997) added special "gates" to choose what to remember and forget. But the fundamental problem remained: one word at a time, everything compressed into a fixed-size memory, and no way to go back.

The RNN BottleneckThecatsatonthematbecauseith1h2h3h4h5h6h7h8memory of 'cat' has faded by the time we reach 'it'What if every word could see every other word at once?itsatthematstrong attention

RNNs process one word at a time -- memory degrades. Self-attention connects every word directly.

RNNs read language like someone peering through a keyhole -- one word at a time, trying to remember everything on a tiny scrap of memory. The model could not go back. It could not look ahead. And the longer the text, the more old information degraded.

In June 2017, eight researchers at Google would propose a radical answer: throw away the entire sequential machinery. Let every word see every other word, all at once. ▶
p2_01_google_hq

In November 2016, Google deployed neural machine translation for Google Translate, replacing a decades-old phrase-based approach. Millions of people used it daily. Even small improvements mattered at that scale.

PIXEL: Attention helped. But it was chained to the bottleneck. Attention helped. But it was chained to the bottleneck.

The fix was called attention -- published in 2014 by Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Instead of compressing the entire input into one vector, attention let the decoder "look back" at different parts of the input for each output word. It worked. But the underlying architecture was still an RNN.

BEFORE: Basic RNNAFTER: RNN + AttentionSource: "Le chat est sur le tapis"Single VectorEverything squeezedthrough a strawDecoder tries to reconstructLost details. Degraded memory.Source: "Le chat est sur le tapis"Decoder focuses on relevant wordsDirect access. Much better!Bahdanau, Cho & Bengio (2014) -- Attention as an add-on to RNNs"But the RNN backbone was still the bottleneck."

Attention (2014) let the decoder look back at source words -- but was still bolted onto a slow RNN.

Several researchers at Google were growing frustrated. The attention mechanism was the best part. The RNN backbone was the bottleneck. What if you kept the attention and threw away everything else?

p2_07_vaswani
SPROUT: What if attention was ALL you needed? What if attention was ALL you needed?

Think about how you read a complicated sentence. Do you read it one word at a time, left to right, never going back? Or do your eyes jump around -- rereading the beginning, skipping ahead, connecting words that are far apart? Which approach sounds more like how understanding actually works?

On June 12, 2017, the team uploaded their paper to arXiv. They gave it a title that was either brilliantly confident or recklessly bold. It was both. ▶
p3_01_arxiv_paper

Eight authors. Twelve pages. One audacious claim: you could throw away the entire recurrent framework and replace it with a single mechanism -- attention alone. They called the architecture the Transformer.

CHALKBOT: They titled the paper like a mic drop. They titled the paper like a mic drop.
The Paper's Key ResultsModelBLEU (EN-DE)TrainingPrevious Best (RNN)~26Days to weeksTransformer (base)28.412 hours (8 GPUs)Better results. A fraction of the time.As of 2024: over 130,000 citations

The Transformer beat previous translation models in quality AND training speed.

p3_05_neurips

The speed improvement was even more dramatic. Because every word could attend to every other word simultaneously, the entire computation could be spread across GPU cores in parallel. No more waiting for word five to finish before starting word six.

The Diaspora: One Paper, Eight Authors, At Least Six Companies"Attention Is All You Need"VaswaniEssential AIShazeerCharacter.AIGomezCoherePolosukhinNEAR ProtocolParmarAdept AIJonesSakana AIKaiserResearchUszkoreitResearch"The paper's greatest product was not the Transformer -- it was the startup founders."

Most of the eight authors eventually left Google to found AI companies.

The Transformer paper was written to solve machine translation. Its authors did not set out to create the foundation of ChatGPT, Claude, or the entire modern AI industry. But the architecture was so powerful and so general that it became exactly that. Great inventions often outgrow their inventors' intentions.

The Transformer worked. But what was it, exactly? How does "attention" actually function? Time to open the hood. ▶
SCHOLAR: I'll use an analogy, then the real thing. Ready? I'll use an analogy, then the real thing. Ready?

Imagine you are at a party. Fifty people, all talking. An RNN listens to one person at a time, in order. By person 30, person 1 is a blur. Self-attention works differently: you can hear everyone simultaneously and instantly decide who is most relevant to what you need to understand right now.

Self-Attention: Query, Key, ValueQuery (Q)"What am I looking for?"Raising your hand to askKey (K)"What do I have to offer?"Wearing a name tagValue (V)"Here is my information"The actual content itselfExample: "The cat sat on the mat because it was soft."ThecatsatonthematbecauseitwassoftQuery from "it":"What noun am I?"mediumHIGHlowResult: "it" now carries information mostly from "mat"Attention(Q, K, V) = softmax(Q * K^T / sqrt(d)) * V

Each word computes Query, Key, Value. Queries match against Keys to determine attention weights.

Now here is the clever part: multi-head attention. Instead of doing this once, the Transformer runs multiple attention operations in parallel (the original paper used 8 "heads"). Each head learns to focus on different kinds of relationships.

Multi-Head Attention: Different Heads, Different RelationshipsHead 1: CoreferenceThe cat sat on themat because it was softmatHead 2: Adjective-NounThe cat sat on themat because it was softmatHead 3: Subject-VerbThe cat sat on themat because it was softDifferent heads learn different relationships. Together, they capture the full picture.The Formula (Simplified)Attention(Q,K,V) = softmax(Q*K^T / sqrt(d)) * VMatch Queries to Keys, scale, turn into percentages, then weight the Values.

Each attention head captures a different type of relationship between words.

Self-attention lets every word in a sentence look at every other word simultaneously. Instead of reading left to right and hoping to remember, the model builds a complete web of relationships in a single step. This is why Transformers are so powerful -- and why they can be trained in parallel on modern GPUs.

But wait. If every word looks at every other word simultaneously, how does the model know what ORDER the words are in? "The cat ate the fish" and "The fish ate the cat" have the same words. ▶
p5_01_text_machine
TINBOT: We cut words into smart pieces. We cut words into smart pieces.

How many words are in English? Between 170,000 and over a million. Add every language, typos, slang, URLs, emoji -- the number is effectively infinite. We can't give every word its own number. But individual letters carry almost no meaning. The solution: tokenization -- splitting text into subword pieces.

Byte-Pair Encoding (BPE) in ActionStart:lower(5 tokens)Step 1:lowere+r merged!Step 2:lowerl+o merged!Step 3:lowerlo+w merged!Final:lower1 token!How Different Words Get Tokenized"the"[the] = 1 token"unhappiness"[un][happi][ness] = 3 tokens

BPE merges frequent character pairs until common words become single tokens.

The result is elegant. Common words like "the" are single tokens. Rare words get split: "unhappiness" becomes [un][happi][ness]. The model handles any word, even one it has never seen before, by breaking it into familiar fragments.

p5_06_position_encoding
STARLIGHT: Positional encoding stamps each token with its address. Positional encoding stamps each token with its address.

Since the Transformer processes all tokens in parallel, it has no idea what order they are in. "Dog bites man" and "Man bites dog" would look identical. The solution: positional encoding -- stamping each token with a unique mathematical pattern. Think of it like putting an address on each puzzle piece: "I am token number 7 in this sequence."

Try tokenizing a word yourself! The word "unbreakable" would probably become [un][break][able]. Now try "antidisestablishmentarianism" -- how many pieces do you think that becomes? The model handles any word, even one it has never seen, by breaking it into familiar fragments.

POSBOT: Where a word sits changes its meaning entirely. Where a word sits changes its meaning entirely.
Now we have all the ingredients: self-attention to find relationships, tokenization to turn text into numbers, and positional encoding to preserve word order. Time to put it all together. ▶
CLOCKWORK: Let me walk you through it, floor by floor. Let me walk you through it, floor by floor.

The original Transformer has two halves: an encoder (which reads the input) and a decoder (which generates the output). For translation, the encoder reads the French sentence, and the decoder writes the English version.

The Transformer BlueprintENCODER (left tower)DECODER (right tower)Input: "Le chat est sur le tapis"Token Embedding + PositionSelf-AttentionEvery word attends to every wordAdd & NormalizeFeed-Forward NetworkEach token "thinks" independentlyx6layersEncoder OutputOutput so far: "The cat is on"Token Embedding + PositionMasked Self-AttentionCan only see earlier tokensCross-AttentionReads from encoder outputAdd & NormalizeFeed-Forward NetworkEach token "thinks" independentlyAdd & Normalizex6layersLinear + SoftmaxNext word: "the"Cross-AttentionEvery component can be computed in parallel. No sequential waiting.This is why Transformers train so much faster than RNNs.

The Transformer: Encoder reads input, Decoder generates output, Cross-Attention bridges them.

The architecture stacks a small set of ideas -- attention, feed-forward networks, residual connections, normalization -- into a repeating pattern. Later models used it differently: BERT (2018) used only the encoder. GPT (2018) used only the decoder. But all are descendants of this paper.

p6_05_parallel_scene

The breakthrough is not that any single component is revolutionary. It's that these ingredients, combined in this specific way, produce something far greater than the sum of their parts. And everything can run in parallel on GPUs.

The Transformer was designed for translation. But a small team at OpenAI had a different idea: what if you threw away the encoder, kept only the decoder, and just kept making it bigger? ▶
p7_01_scaling_buildings
SPROUT: The answer shook the world. The answer shook the world.

In June 2018, OpenAI researcher Alec Radford published GPT-1: the Transformer's decoder trained to predict the next word. 117 million parameters, 12 layers, trained on ~7,000 books. It worked well. Not spectacularly. But enough to ask: what if the only limit was scale?

p7_04_gpt2_controversy

In February 2019, GPT-2 arrived: 1.5 billion parameters, 48 layers. Ten times larger. Then OpenAI announced it was withholding the full model due to misuse concerns. Critics called it publicity-seeking. Supporters called it responsible foresight. Either way, the public was paying attention to language AI for the first time.

PIXEL: GPT-3 did tasks it was never trained for! GPT-3 did tasks it was never trained for!
Few-Shot LearningPrompt given to GPT-3:Translate English to French:sea otter => loutre de mercheese => fromagehello =>GPT-3: bonjourNever trained as a translator.Learned the pattern from examples in the prompt.

GPT-3 could learn new tasks from just a few examples in the prompt.

The Scaling LadderModelDateParametersTraining DataKey CapabilityGPT-1Jun 2018117M~800M wordsBasic understandingGPT-2Feb 20191.5B40GB web textCoherent paragraphsGPT-3Jun 2020175B570GBFew-shot learning

Each generation was roughly 10x larger -- and each unlocked qualitatively new abilities.

Scaling revealed a strange law: make a model 10x bigger, and it does not just get 10% better. It sometimes learns entirely new skills the smaller version could not do at all. Nobody programmed these abilities in. They emerged from sheer scale.

GPT-3 was trained only to predict the next word. Yet it learned grammar, facts, translation, coding, and reasoning. Is "predict the next word" really a simple task? Or is it secretly the hardest task there is -- because to predict perfectly, you would need to understand everything?

SCALEBOT: More data plus more compute equals emergent abilities. More data plus more compute equals emergent abilities.
Scaling was working. But raw next-word prediction had a problem: the model would generate toxic content, fabricate facts, and follow harmful instructions. Someone needed to teach it values. ▶
p8_01_split_image

A raw language model is a mirror of its training data. The internet contains medical advice and misinformation, poetry and hate speech. A model trained to predict "what comes next" will produce all of it. Making models bigger made them more capable AND more dangerous.

CHALKBOT: RLHF is like a tutor showing a better way. RLHF is like a tutor showing a better way.
The RLHF Pipeline (3 Steps)1Supervised Fine-TuningHuman writes ideal responseModel learns helpful style"When someone asks X, a good response looks like Y."2Reward Model TrainingModel writes 4 responsesHumans rank: A > B > C > DTrain a 'judge' model on preferences3Reinforcement LearningModel generates -> reward model scoresAdjust to produce higher-scoring responsesRepeat thousands of times

RLHF: Show examples, train a judge on human preferences, then optimize the model.

In January 2022, OpenAI published InstructGPT: a 1.3-billion-parameter model fine-tuned with RLHF was preferred by humans over the raw 175-billion-parameter GPT-3. A smaller model with values beat a larger model without them.

p8_06_chatgpt_launch

On November 30, 2022, ChatGPT launched as a free research preview. It reached 1 million users in 5 days. By January 2023, roughly 100 million monthly active users -- the fastest-growing consumer application in history. For comparison, TikTok took ~9 months to reach the same milestone.

SCHOLAR: Human labelers made alignment possible. Human labelers made alignment possible.

These human evaluators -- often contract workers in Kenya, the Philippines, and other countries, paid far less than the engineers -- are an invisible but essential part of modern AI. Their judgments teach the model what "good" looks like.

RLHF is not a complete solution. Models can still hallucinate, be tricked, and produce harmful outputs. But it represents a crucial insight: training a model to be capable and training it to be good are two different problems, and both require deliberate effort.

FACTORYBOT: Millions of workers, processing language in parallel. Millions of workers, processing language in parallel.
RLHF was a breakthrough. But it depended on thousands of costly, inconsistent human judgments. What if the AI could learn values from a written set of principles instead -- a constitution? ▶
p9_01_founding

In 2021, Dario Amodei and Daniela Amodei left OpenAI and founded Anthropic. Dario -- PhD in computational neuroscience, former VP of Research at OpenAI -- believed AI safety could not be an afterthought. It needed to be the mission.

STARLIGHT: What if the AI could judge itself using written rules? What if the AI could judge itself using written rules?

In December 2022, Anthropic published Constitutional AI (CAI). The core idea: instead of relying on case-by-case human judgments, write a constitution -- a set of explicit principles -- and use it to guide behavior. Like how human societies work.

The Constitutional AI Loop1Model generates response(including to tricky prompts)2Model reads constitutional principle"Is this response honest about uncertainty? Is it helpful? Could it cause harm?"3Model critiques its own response"No, I stated a guess as a fact."4Model revises its responseAdds hedging language, acknowledges limitationsRevised response becomes training data

Constitutional AI: the model critiques and revises its own responses based on written principles.

RLHF vs. Constitutional AIRLHFConstitutional AIWho judges?Human evaluatorsAI + written principlesScalabilityLimited by human laborReduced human laborTransparencyImplicit in preferencesExplicit, written downConsistencyVaries between judgesSame principles every time

Constitutional AI front-loads human judgment into written principles rather than case-by-case rankings.

p9_07_landscape

Claude 1 launched in March 2023. Claude 2 followed in July 2023 with a 100,000-token context window -- enough to process entire books. Meanwhile, Google released PaLM, Meta released Llama 2 as open source, and GPT-4 added multimodal capability.

Human societies use constitutions to encode values that outlast any individual leader. Could AI constitutions serve the same purpose? If you could write the rules that govern an AI's behavior, what principles would YOU include?

ETHICSBOT: Helpful, harmless, honest. That is the goal. Helpful, harmless, honest. That is the goal.
From a single translation paper to a global industry in six years. But every layer connects to something deeper. Let's zoom out and see how it all fits together. ▶
p10_01_thread_timeline

Turing imagined a machine that could follow any instruction. Von Neumann built one. Hopper taught it to understand English. The internet connected them all. Neural networks learned to see. And now Transformers have learned to read, write, and reason.

Key Connections Across All IssuesIssue 6: Backpropstill trains every TransformerIssue 6: GPUsare the essential hardwareIssue 5: The Webprovides the training dataIssues 4-5: Open Sourceaccelerates developmentIssue 4: Composable toolsreturn as AI agents piping data via tools (Issues 8-9)Every revolution builds a new layer of abstraction on top of everything before.

The Transformer did not replace neural networks, backpropagation, or GPUs. It stood on all of them.

TAPE: But they're still trapped in a text box. But they're still trapped in a text box.

In 2021, Timnit Gebru, Emily Bender, Margaret Mitchell, and Angelina McMillan-Major asked: are LLMs actually understanding language, or just mimicking patterns like a sophisticated parrot? They raised critical questions about environmental cost, biased training data, and whether the race to scale is outpacing our ability to control what we are building.

The "Stochastic Parrots" paper argued that: (1) LLMs do not truly "understand" language; (2) training massive models has enormous environmental cost; (3) biases in training data get amplified at scale; (4) the race to scale is driven by corporate incentives, not just science. These are not fringe concerns -- they are central tensions in the story of modern AI.

p10_07_trapped_textbox

Here is where we are in early 2023: language models can write essays, code, poetry, and analysis. But they are still trapped. An LLM lives in a text box. It can write a Python function, but it cannot run it. It can describe a plan, but it cannot execute it.

Every revolution in computing follows the same pattern: someone builds a new layer of abstraction on top of everything that came before. The Transformer did not replace neural networks, backpropagation, or GPUs. It stood on all of them.

WINDOWBOT: So smart, but trapped in a chat window. Familiar? So smart, but trapped in a chat window. Familiar?
References & Further Reading
In our next issue, the text box shatters. LLMs learn to use tools -- to run code, read files, browse the web, and take real-world actions. The age of AI agents begins. Issue 8: "The Agent Awakens." ▶