From Turing to LLMs and Beyond ยท Issue 7 of 10
Issue 7 ยท 2017โ€“2023

Attention Is All You Need

โ† Previous Issue: Machines That Learn

Imagine trying to understand a novel, but you can only see ONE word at a time, and you have to memorize everything on a single sticky note. That was language AI for twenty years. Then eight researchers changed everything. Tera

In Issue 6, we watched neural networks learn to see. They could look at a photograph and tell you "that's a cat." They could scan an X-ray and spot a tumor. Vision was conquered.

But language? Language was a different beast.

By 2016, the best language AI -- systems that translated text, answered questions, or tried to summarize documents -- all worked the same way. They read text one word at a time, left to right. Each word updated a small bundle of numbers called a hidden state -- the model's attempt to remember everything it had read so far.

This approach had a name: Recurrent Neural Networks, or RNNs. And it had a fatal flaw.

The RNN Bottleneck The cat sat on the mat because it was tired hidden state (fading memory) By the time we reach "it," the memory of "cat" has faded What if every word could look at every other word, all at once?
The Problem: RNNs read language like someone peering through a keyhole -- one word at a time, trying to remember everything on a tiny scrap of memory. The model could not go back. It could not look ahead. And the longer the text, the more it forgot.

An improvement called LSTM (Long Short-Term Memory, invented by Sepp Hochreiter and Jurgen Schmidhuber in 1997) gave the model a slightly bigger, better-organized sticky note. It helped, but the fundamental problem remained: one word at a time, everything compressed into a fixed-size memory.

The field needed a breakthrough. Not an incremental improvement. A completely different way of reading.

The Problem Was Real -- and Getting Worse

People had been patching the RNN problem for years. One important fix -- called "attention" -- let the model peek back at earlier words. It was a band-aid on a broken leg. But it hinted at something much bigger. Tera

In November 2016, Google deployed a neural machine translation system for Google Translate, replacing a decades-old phrase-based approach. Millions of people used it daily.

But the new system was built on RNNs with a bolt-on fix called attention. This idea, published in 2014 by Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, was a genuine breakthrough. Instead of forcing the decoder to rely on a single compressed vector, attention let it "look back" at different parts of the input while generating each output word.

The Attention Breakthrough (2014) BEFORE: Basic RNN Entire sentence compressed into a single tiny vector squeezed through straw AFTER: RNN + Attention Decoder "looks back" at each word in the source sentence Bahdanau, Cho & Bengio (2014) "Attention was the best part. The RNN was the bottleneck. What if attention was ALL you needed?"

Several researchers at Google Brain and Google Research were growing frustrated. The attention mechanism was the best part of these models. The RNN backbone was the bottleneck. What if you kept the attention and threw away everything else?

It was a radical thought. RNNs had been the foundation of sequence processing for years. But the idea would not go away.

Think About It: Think about how you read a complicated sentence. Do you read it one word at a time, left to right, never going back? Or do your eyes jump around -- rereading the beginning, skipping ahead, connecting words that are far apart? Which approach sounds more like how understanding actually works?

"Attention Is All You Need" -- The Paper That Rewired AI

Eight authors. Twelve pages. One audacious claim: you could throw away the ENTIRE recurrent framework and replace it with a single mechanism. They titled the paper like a mic drop -- and they were right. Tera

On June 12, 2017, the paper appeared on arXiv. Its title -- "Attention Is All You Need" -- was deliberately provocative. Some of the authors reportedly worried it was too bold.

The core claim was stunning in its simplicity: a model built entirely from attention mechanisms -- no recurrence, no convolutions -- could achieve state-of-the-art results on machine translation. And it could do so far faster to train.

They called the architecture the Transformer.

The team was led by Ashish Vaswani from India (PhD from USC), joined by Noam Shazeer (a legendary Google engineer), and six others: Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez (a University of Toronto intern), Lukasz Kaiser, and Illia Polosukhin.

The Paper at a Glance Model BLEU (EN-DE) Training Time Previous Best (RNN) ~26 Days to weeks Transformer (base) 28.4 12 hours (8 GPUs) Better results. A fraction of the time.

The paper was presented at NeurIPS 2017 in Long Beach, California. The reception was electric. As of 2024, it has been cited over 130,000 times -- one of the most cited papers in computer science history.

The Diaspora: One Paper, Eight Authors, Six+ Companies Vaswani Essential AI Shazeer Character.AI Gomez Cohere Polosukhin NEAR Protocol Parmar Adept AI Jones Sakana AI Kaiser Research Uszkoreit Research The paper's greatest product was not the Transformer -- it was the startup founders.
Great inventions outgrow their inventors' intentions. The Transformer paper was written to solve machine translation. Its authors did not set out to create the foundation of ChatGPT, Claude, or the entire modern AI industry. But the architecture they built was so powerful and so general that it became exactly that.

Self-Attention -- Every Word Looks at Every Other Word

Okay, this is the big one. Self-attention is the heart of the Transformer. For every word, the model computes three things: a Query, a Key, and a Value. Once you understand this, everything else clicks. Ready? Tera

For every word (technically, every token) in a sentence, the model computes three things:

Query (Q): "What am I looking for?" -- like raising your hand with a question.

Key (K): "What do I have to offer?" -- like wearing a name tag describing your expertise.

Value (V): "Here is my actual information." -- the content itself.

Each word compares its Query against every other word's Key. Words whose Keys match the Query well get high attention scores. Those scores are used to weight the Values.

Self-Attention Step by Step

Step 0 / 4
Multi-Head Attention: Different Heads, Different Relationships Head 1: Coreference "it" โ†’ "cat" Head 2: Adj-Noun "tired" โ†’ "cat" Head 3: Subject-Verb "sat" โ†’ "cat" Combined: Full Picture 8 heads run in parallel Attention(Q, K, V) = softmax(Q * K^T / sqrt(d)) * V
The Key Insight: Self-attention lets every word in a sentence look at every other word simultaneously. Instead of reading left to right and hoping to remember, the model builds a complete web of relationships in a single step. This is why Transformers can be trained in parallel on modern GPUs.

Tokenization -- How "Unhappiness" Becomes [un][happi][ness]

Computers don't understand words. They understand numbers. So we cut words into smart little pieces called tokens. The trick is balancing vocabulary size with meaning. Tera

How many words are there in English? Somewhere between 170,000 and over a million. Add every language, typos, slang, URLs, emoji. The number is effectively infinite.

The solution is tokenization: splitting text into subword pieces using Byte-Pair Encoding (BPE), originally a data compression algorithm from 1994, adapted for NLP by Rico Sennrich, Barry Haddow, and Alexandra Birch in 2016.

Byte-Pair Encoding in Action

Step 0 / 4
How Words Get Tokenized Word Tokens Count "the" the 1 "computer" comput er 2 "unhappiness" un happi ness 3 "sdkjfhsk" s dk j f hs k 6 Common words = fewer tokens. Rare/gibberish = more tokens.

One more essential ingredient: positional encoding. Since the Transformer processes all tokens in parallel, it has no idea what order they are in. "Dog bites man" and "Man bites dog" would look identical. The original Transformer solved this by adding a unique mathematical pattern (based on sine and cosine waves) to each token's representation -- like stamping an address on each puzzle piece.

Think About It: Try tokenizing a word yourself! The word "unbreakable" would probably become [un][break][able]. The model handles ANY word, even one it has never seen before, by breaking it into familiar fragments.

The Transformer Architecture -- A Visual Walkthrough

Here it is -- the architecture that powers almost every major AI system today. Let me walk you through it, floor by floor. Tera
The Transformer Blueprint ENCODER "Le chat est sur le tapis" Token Embed + Position x6 Self-Attention Feed-Forward + Add&Norm Encoder Output DECODER "The cat is on" Token Embed + Position x6 Masked Self-Attention Cross-Attention Feed-Forward + Add&Norm Cross-Attention Softmax โ†’ Next word: "the" Every component can be computed in parallel. This is why Transformers train so much faster than RNNs.

Later models would use the architecture differently. BERT (2018, Google) used only the encoder. Led by Jacob Devlin, BERT achieved state-of-the-art results on 11 benchmarks simultaneously, pioneering the "pre-train then fine-tune" approach. GPT (2018, OpenAI) used only the decoder. But all of them are descendants of this paper.

The Breakthrough Is Combination. The Transformer stacks a small set of ideas -- attention, feed-forward networks, residual connections, normalization -- into a repeating pattern. No single component is revolutionary. But combined in this specific way, they produce something far greater than the sum of their parts.

Scaling -- GPT-1, GPT-2, GPT-3, and the Surprise

This is where things get wild. A quiet researcher at OpenAI takes the decoder half, trains it to predict the next word, then asks: what happens if we make it 10x bigger? 100x? 1,000x? The answer shook the world. Tera

In June 2018, OpenAI researcher Alec Radford published GPT-1 with co-authors including Ilya Sutskever. The idea: take the Transformer's decoder, train it to predict the next word on about 7,000 unpublished books. 117 million parameters, 12 layers.

In February 2019, GPT-2 arrived: 1.5 billion parameters, trained on 40GB of web text. OpenAI withheld the full model, calling it "too dangerous to release." The controversy made headlines.

Then the earthquake. In June 2020, GPT-3: 175 billion parameters, trained on 570GB of text. Training cost an estimated $4.6 million (per a widely-cited Lambda Labs analysis -- OpenAI has not disclosed the actual figure).

The Scaling Ladder GPT-1 117M params Jun 2018 GPT-2 1.5B params Feb 2019 GPT-3 175B params Jun 2020 1,500x larger
Prompt given to GPT-3: Translate English to French: sea otter => loutre de mer cheese => fromage hello => GPT-3's completion: bonjour Nobody trained GPT-3 on translation. It learned the PATTERN from the examples.
Emergent Abilities: Make a model 10x bigger, and it sometimes learns entirely new skills that the smaller version could not do at all. Nobody programmed these abilities in. They emerged from the sheer scale of pattern-learning. This is the scaling hypothesis -- and it is still debated. Some researchers argue these "emergent" jumps may be artifacts of how we measure performance rather than genuine phase transitions (Schaeffer et al., 2023). The debate is unresolved. In 2022, DeepMind's Chinchilla paper showed that many models, including GPT-3, were likely undertrained -- a smaller model trained on more data could match their performance. Size alone is not the whole story.
Think About It: GPT-3 was trained only to predict the next word. That is its ENTIRE training objective. Yet from this simple task, it learned grammar, facts, translation, coding, and basic reasoning. Is "predict the next word" really a simple task? Or is it secretly the hardest task there is?

RLHF -- Teaching AI to Be Helpful, Honest, and Harmless

A base language model is like a brilliant student who has read the entire internet -- the good, the bad, and the terrible. RLHF is like giving that student a thoughtful tutor who teaches a BETTER way to answer. Tera

A raw language model trained only on next-word prediction is a mirror of its training data. The internet contains medical advice and misinformation, poetry and hate speech, helpful tutorials and scam emails. A model trained to predict "what comes next" will produce all of it.

A key precursor came from OpenAI in 2020, when Nisan Stiennon and colleagues showed that RLHF could improve text summarization -- proving the concept.

The answer was Reinforcement Learning from Human Feedback, or RLHF.

The RLHF Pipeline Stage 1: SFT Human writes ideal responses โ†’ Model learns Stage 2: Reward Model Model writes 4 responses Human ranks: A > B > C > D "Judge" model learns prefs Stage 3: RL Model generates response Reward model scores it Model adjusts โ†’ Repeat The InstructGPT Revelation (Jan 2022) A 1.3B parameter model with RLHF was preferred over the raw 175B GPT-3. Values beat raw scale. ChatGPT Growth Nov 30 1M users (5 days) ~100M users (~2 months) TikTok: ~9 months for same

On November 30, 2022, ChatGPT -- built on GPT-3.5 with RLHF -- launched as a free "research preview." It reached 1 million users in 5 days and approximately 100 million monthly active users by January 2023 -- the fastest-growing consumer application in history at that time.

Two Different Problems: RLHF is not a complete solution to AI safety. Models can still hallucinate, be tricked, and produce harmful outputs. But it represents a crucial insight: training a model to be capable and training it to be good are two different problems, and both require deliberate effort.

Behind the elegant diagram lies a human reality: the "human feedback" in RLHF comes from thousands of workers -- often low-paid contractors in developing countries -- who spend hours ranking model outputs. Their labor is invisible but essential.

Constitutional AI and Claude -- Building AI with Principles

How do you teach a machine to be good? RLHF uses human judges, but humans are expensive and can disagree. What if you gave the AI a written set of rules and said "judge yourself"? Tera

In 2021, Dario Amodei (PhD in computational neuroscience from Princeton, formerly VP of Research at OpenAI) and Daniela Amodei (Stanford International Relations, formerly VP of Operations at OpenAI) left to found Anthropic, bringing roughly ten former OpenAI colleagues with them. Anthropic was structured as a public benefit corporation prioritizing safety alongside capability.

In December 2022, Anthropic published Constitutional AI (CAI). The core idea: instead of relying on case-by-case human judgments, you write a constitution -- a set of explicit principles -- and use it to guide behavior.

RLHF vs. Constitutional AI RLHF Who judges? Human evaluators Scalability: Limited by human labor Transparency: Implicit preferences Consistency: Varies between judges Expensive but proven Constitutional AI Who judges? AI + written principles Scalability: Highly scalable Transparency: Principles written down Consistency: Same rules every time Auditable and reproducible Both approaches have tradeoffs. The field uses elements of both.
The LLM Landscape (2018-2023) 2018 GPT-1 BERT 2019 GPT-2 2020 GPT-3 2022 ChatGPT PaLM 2023 Claude 1 GPT-4 Jul 23 Claude 2 Llama 2 From a single translation paper to a global industry in six years.
Think About It: Human societies use constitutions to encode values that outlast any individual leader. Could AI constitutions serve the same purpose? If you could write the rules that govern an AI's behavior, what principles would YOU include?

The Longest Thread -- and What Comes Next

We have come so far. Turing imagined a machine that could follow any instruction. Neural networks learned to see. And now Transformers have learned to read, write, and something that looks remarkably like reasoning. But they are still trapped in a text box. They can write code but cannot run it. What happens when we set them free? Tera
The Thread: From Turing's Tape to the Attention Mechanism 1 Turing Machine 2 Real Computers 3 Programming Languages 4 Unix & Networks 5 The Web & Data 6 Neural Networks 7 Transformers & LLMs Key Connections Backprop (Issue 6) still trains every Transformer GPUs (Issue 6) are the essential hardware The Web (Issue 5) provides the training data Open Source (Issues 4-5) accelerates it all

The connections run deep. Backpropagation (Issue 6) is still the training algorithm. GPUs went from video game hardware to the most valuable computing resource on Earth. The World Wide Web (Issue 5) created the data -- Common Crawl is the backbone of every major LLM's training set. Open source (Issues 4-5) accelerated everything.

But the story also carries difficult questions. In their 2021 paper "On the Dangers of Stochastic Parrots," Timnit Gebru and Margaret Mitchell raised critical concerns: LLMs may be sophisticated pattern-matchers rather than genuine understanders, and scaling them has real environmental and social costs -- from the energy consumed in training to the biased data they absorb. Both were fired from Google in connection with this research, and the controversy remains a flashpoint about corporate control of AI ethics research. Their core argument -- that we should ask not just "can we build it?" but "should we, and at what cost?" -- remains unresolved.

While this story follows the GPT lineage, the landscape is far broader. Meta released Llama as open-source, enabling thousands of researchers worldwide. The BigScience project brought together 1,000+ researchers from 60 countries to build BLOOM. Labs worldwide built their own frontier models. The Transformer's power comes partly from the fact that it is an architecture anyone can build on.

The Full Stack: Every revolution in computing follows the same pattern: someone builds a new layer of abstraction on top of everything that came before. The Transformer did not replace neural networks, backpropagation, or GPUs. It stood on all of them. Understanding the full stack -- from Turing's tape to the attention mechanism -- is what separates people who use AI from people who understand it.
Next Issue: The text box shatters. LLMs learn to use tools -- to run code, read files, browse the web, and take real-world actions. The age of AI agents begins. Issue 8: "The Agent Awakens" โ†’

References & Further Reading