From Turing to LLMs and Beyond — Issue 7: Attention Is All You Need

← Previous Issue: Machines That Learn

In Issue 6, we watched neural networks learn to see. They could look at a photograph and tell you "that's a cat." They could scan an X-ray and spot a tumor. Vision was conquered.

But language? Language was a different beast.

By 2016, the best language AI -- systems that translated text, answered questions, or tried to summarize documents -- all worked the same way. They read text one word at a time, left to right. Each word updated a small bundle of numbers called a hidden state -- the model's attempt to remember everything it had read so far.

This approach had a name: Recurrent Neural Networks, or RNNs. And it had a fatal flaw.

The Problem: RNNs read language like someone peering through a keyhole -- one word at a time, trying to remember everything on a tiny scrap of memory. The model could not go back. It could not look ahead. And the longer the text, the more it forgot.

An improvement called LSTM (Long Short-Term Memory, invented by Sepp Hochreiter and Jurgen Schmidhuber in 1997) gave the model a slightly bigger, better-organized sticky note. It helped, but the fundamental problem remained: one word at a time, everything compressed into a fixed-size memory.

The field needed a breakthrough. Not an incremental improvement. A completely different way of reading.

The Problem Was Real -- and Getting Worse

In November 2016, Google deployed a neural machine translation system for Google Translate, replacing a decades-old phrase-based approach. Millions of people used it daily.

But the new system was built on RNNs with a bolt-on fix called attention. This idea, published in 2014 by Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, was a genuine breakthrough. Instead of forcing the decoder to rely on a single compressed vector, attention let it "look back" at different parts of the input while generating each output word.

Several researchers at Google Brain and Google Research were growing frustrated. The attention mechanism was the best part of these models. The RNN backbone was the bottleneck. What if you kept the attention and threw away everything else?

It was a radical thought. RNNs had been the foundation of sequence processing for years. But the idea would not go away.

Think About It: Think about how you read a complicated sentence. Do you read it one word at a time, left to right, never going back? Or do your eyes jump around -- rereading the beginning, skipping ahead, connecting words that are far apart? Which approach sounds more like how understanding actually works?

"Attention Is All You Need" -- The Paper That Rewired AI

On June 12, 2017, the paper appeared on arXiv. Its title -- "Attention Is All You Need" -- was deliberately provocative. Some of the authors reportedly worried it was too bold.

The core claim was stunning in its simplicity: a model built entirely from attention mechanisms -- no recurrence, no convolutions -- could achieve state-of-the-art results on machine translation. And it could do so far faster to train.

They called the architecture the Transformer.

The team was led by Ashish Vaswani from India (PhD from USC), joined by Noam Shazeer (a legendary Google engineer), and six others: Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez (a University of Toronto intern), Lukasz Kaiser, and Illia Polosukhin.

The paper was presented at NeurIPS 2017 in Long Beach, California. The reception was electric. As of 2024, it has been cited over 130,000 times -- one of the most cited papers in computer science history.

Great inventions outgrow their inventors' intentions. The Transformer paper was written to solve machine translation. Its authors did not set out to create the foundation of ChatGPT, Claude, or the entire modern AI industry. But the architecture they built was so powerful and so general that it became exactly that.

Self-Attention -- Every Word Looks at Every Other Word

For every word (technically, every token) in a sentence, the model computes three things:

Query (Q): "What am I looking for?" -- like raising your hand with a question.

Key (K): "What do I have to offer?" -- like wearing a name tag describing your expertise.

Value (V): "Here is my actual information." -- the content itself.

Each word compares its Query against every other word's Key. Words whose Keys match the Query well get high attention scores. Those scores are used to weight the Values.

Self-Attention Step by Step

Step 0 / 4

The Key Insight: Self-attention lets every word in a sentence look at every other word simultaneously. Instead of reading left to right and hoping to remember, the model builds a complete web of relationships in a single step. This is why Transformers can be trained in parallel on modern GPUs.

Tokenization -- How "Unhappiness" Becomes [un][happi][ness]

How many words are there in English? Somewhere between 170,000 and over a million. Add every language, typos, slang, URLs, emoji. The number is effectively infinite.

The solution is tokenization: splitting text into subword pieces using Byte-Pair Encoding (BPE), originally a data compression algorithm from 1994, adapted for NLP by Rico Sennrich, Barry Haddow, and Alexandra Birch in 2016.

Byte-Pair Encoding in Action

Step 0 / 4

One more essential ingredient: positional encoding. Since the Transformer processes all tokens in parallel, it has no idea what order they are in. "Dog bites man" and "Man bites dog" would look identical. The original Transformer solved this by adding a unique mathematical pattern (based on sine and cosine waves) to each token's representation -- like stamping an address on each puzzle piece.

Think About It: Try tokenizing a word yourself! The word "unbreakable" would probably become [un][break][able]. The model handles ANY word, even one it has never seen before, by breaking it into familiar fragments.

The Transformer Architecture -- A Visual Walkthrough

Later models would use the architecture differently. BERT (2018, Google) used only the encoder. Led by Jacob Devlin, BERT achieved state-of-the-art results on 11 benchmarks simultaneously, pioneering the "pre-train then fine-tune" approach. GPT (2018, OpenAI) used only the decoder. But all of them are descendants of this paper.

The Breakthrough Is Combination. The Transformer stacks a small set of ideas -- attention, feed-forward networks, residual connections, normalization -- into a repeating pattern. No single component is revolutionary. But combined in this specific way, they produce something far greater than the sum of their parts.

Scaling -- GPT-1, GPT-2, GPT-3, and the Surprise

In June 2018, OpenAI researcher Alec Radford published GPT-1 with co-authors including Ilya Sutskever. The idea: take the Transformer's decoder, train it to predict the next word on about 7,000 unpublished books. 117 million parameters, 12 layers.

In February 2019, GPT-2 arrived: 1.5 billion parameters, trained on 40GB of web text. OpenAI withheld the full model, calling it "too dangerous to release." The controversy made headlines.

Then the earthquake. In June 2020, GPT-3: 175 billion parameters, trained on 570GB of text. Training cost an estimated $4.6 million (per a widely-cited Lambda Labs analysis -- OpenAI has not disclosed the actual figure).

Prompt given to GPT-3: Translate English to French: sea otter => loutre de mer cheese => fromage hello => GPT-3's completion: bonjour Nobody trained GPT-3 on translation. It learned the PATTERN from the examples.

Emergent Abilities: Make a model 10x bigger, and it sometimes learns entirely new skills that the smaller version could not do at all. Nobody programmed these abilities in. They emerged from the sheer scale of pattern-learning. This is the scaling hypothesis -- and it is still debated. Some researchers argue these "emergent" jumps may be artifacts of how we measure performance rather than genuine phase transitions (Schaeffer et al., 2023). The debate is unresolved. In 2022, DeepMind's Chinchilla paper showed that many models, including GPT-3, were likely undertrained -- a smaller model trained on more data could match their performance. Size alone is not the whole story.

Think About It: GPT-3 was trained only to predict the next word. That is its ENTIRE training objective. Yet from this simple task, it learned grammar, facts, translation, coding, and basic reasoning. Is "predict the next word" really a simple task? Or is it secretly the hardest task there is?

RLHF -- Teaching AI to Be Helpful, Honest, and Harmless

A raw language model trained only on next-word prediction is a mirror of its training data. The internet contains medical advice and misinformation, poetry and hate speech, helpful tutorials and scam emails. A model trained to predict "what comes next" will produce all of it.

A key precursor came from OpenAI in 2020, when Nisan Stiennon and colleagues showed that RLHF could improve text summarization -- proving the concept.

The answer was Reinforcement Learning from Human Feedback, or RLHF.

On November 30, 2022, ChatGPT -- built on GPT-3.5 with RLHF -- launched as a free "research preview." It reached 1 million users in 5 days and approximately 100 million monthly active users by January 2023 -- the fastest-growing consumer application in history at that time.

Two Different Problems: RLHF is not a complete solution to AI safety. Models can still hallucinate, be tricked, and produce harmful outputs. But it represents a crucial insight: training a model to be capable and training it to be good are two different problems, and both require deliberate effort.

Behind the elegant diagram lies a human reality: the "human feedback" in RLHF comes from thousands of workers -- often low-paid contractors in developing countries -- who spend hours ranking model outputs. Their labor is invisible but essential.

Constitutional AI and Claude -- Building AI with Principles

In 2021, Dario Amodei (PhD in computational neuroscience from Princeton, formerly VP of Research at OpenAI) and Daniela Amodei (Stanford International Relations, formerly VP of Operations at OpenAI) left to found Anthropic, bringing roughly ten former OpenAI colleagues with them. Anthropic was structured as a public benefit corporation prioritizing safety alongside capability.

In December 2022, Anthropic published Constitutional AI (CAI). The core idea: instead of relying on case-by-case human judgments, you write a constitution -- a set of explicit principles -- and use it to guide behavior.

Think About It: Human societies use constitutions to encode values that outlast any individual leader. Could AI constitutions serve the same purpose? If you could write the rules that govern an AI's behavior, what principles would YOU include?

The Longest Thread -- and What Comes Next

The connections run deep. Backpropagation (Issue 6) is still the training algorithm. GPUs went from video game hardware to the most valuable computing resource on Earth. The World Wide Web (Issue 5) created the data -- Common Crawl is the backbone of every major LLM's training set. Open source (Issues 4-5) accelerated everything.

But the story also carries difficult questions. In their 2021 paper "On the Dangers of Stochastic Parrots," Timnit Gebru and Margaret Mitchell raised critical concerns: LLMs may be sophisticated pattern-matchers rather than genuine understanders, and scaling them has real environmental and social costs -- from the energy consumed in training to the biased data they absorb. Both were fired from Google in connection with this research, and the controversy remains a flashpoint about corporate control of AI ethics research. Their core argument -- that we should ask not just "can we build it?" but "should we, and at what cost?" -- remains unresolved.

While this story follows the GPT lineage, the landscape is far broader. Meta released Llama as open-source, enabling thousands of researchers worldwide. The BigScience project brought together 1,000+ researchers from 60 countries to build BLOOM. Labs worldwide built their own frontier models. The Transformer's power comes partly from the fact that it is an architecture anyone can build on.

The Full Stack: Every revolution in computing follows the same pattern: someone builds a new layer of abstraction on top of everything that came before. The Transformer did not replace neural networks, backpropagation, or GPUs. It stood on all of them. Understanding the full stack -- from Turing's tape to the attention mechanism -- is what separates people who use AI from people who understand it.

Next Issue: The text box shatters. LLMs learn to use tools -- to run code, read files, browse the web, and take real-world actions. The age of AI agents begins. Issue 8: "The Agent Awakens" →

References & Further Reading

Vaswani, A., et al. "Attention Is All You Need." NeurIPS 2017. (The foundational Transformer paper.)
Radford, A., et al. "Improving Language Understanding by Generative Pre-Training." OpenAI, 2018. (The GPT-1 paper.)
Radford, A., et al. "Language Models are Unsupervised Multitask Learners." OpenAI, 2019. (The GPT-2 paper.)
Brown, T. B., et al. "Language Models are Few-Shot Learners." NeurIPS 2020. (The GPT-3 paper.)
Devlin, J., et al. "BERT: Pre-training of Deep Bidirectional Transformers." 2018. (The BERT paper.)
Ouyang, L., et al. "Training language models to follow instructions with human feedback." 2022. (The InstructGPT/RLHF paper.)
Bai, Y., et al. "Constitutional AI: Harmlessness from AI Feedback." 2022. (Anthropic's Constitutional AI paper.)
Kaplan, J., et al. "Scaling Laws for Neural Language Models." 2020. (Scaling laws.)
Hoffmann, J., et al. "Training Compute-Optimal Large Language Models." 2022. (The Chinchilla paper.)
Alammar, J. "The Illustrated Transformer." 2018. (Outstanding visual explanation -- free online.)
3Blue1Brown. "But what is a GPT? Visual intro to Transformers." YouTube, 2024.
Bender, E. M., Gebru, T., et al. "On the Dangers of Stochastic Parrots." FAccT 2021.