p1_01_cat_wall
TINBOT: Quick. Describe ALL cats. Quick. Describe ALL cats.

In Issue 5, we watched the world get connected. The web, open source, Wikipedia, GitHub -- by the mid-2000s, humanity had assembled the largest collection of knowledge, text, and code ever created.

But computers still could not do something a toddler does effortlessly: look at a photograph and say "cat."

p1_04_edge_detection

For decades, researchers tried the obvious approach: write rules. Detect edges. Measure angles. Count features. Describe what a cat "looks like" in mathematical terms.

The problem? "What a cat looks like" is almost infinitely variable. Has pointy ears? Some cats have floppy ears. Has fur? Sphynx cats are hairless. Has four legs? So does a dog.

SPROUT: A toddler just sees thousands of cats. A toddler just sees thousands of cats.

A three-year-old identifies every cat instantly. How? Not by following rules. Children learn by seeing thousands of examples. Nobody tells a toddler: "triangular ears at a 47-degree angle." The concept just emerges.

What if machines could learn the same way?

Stop writing rules. Show the machine millions of examples. Let the patterns emerge on their own.

p1_09_researchers_silhouette

This is the story of the people who spent decades proving they could. An idea abandoned by the mainstream, kept alive by a handful of stubborn researchers, and vindicated in a single stunning afternoon in 2012.

For sixty years, AI researchers tried to write rules by hand. They called it "symbolic AI." It worked for chess and logic. It failed at vision, language, and common sense. But there was always another approach -- one inspired by the human brain itself. ▶
p2_01_rosenblatt_machine
BRASSBOT: 1958. Rosenblatt built a machine that learns. 1958. Rosenblatt built a machine that learns.

Frank Rosenblatt, a psychologist and computer scientist at Cornell, built the Perceptron -- the first machine that could genuinely learn from data.

The concept was deceptively simple: take inputs, multiply each by a weight, add them up, and check if the sum exceeds a threshold. The breakthrough? The weights were not set by a programmer. The machine adjusted them itself.

The Perceptron (1958)x1 (pixel)x2 (pixel)x3 (pixel)INPUTSw1w2w3ΣsumThreshold?sum > T?YESNOKEY INSIGHT: The weights are LEARNED, not programmed.Show it examples. Adjust weights. Repeat.

A perceptron: inputs multiplied by learned weights, summed, and thresholded.

p2_05_nyt_headline

July 8, 1958. The New York Times: "New Navy Device Learns by Doing." The hype was enormous. The Navy predicted perceptrons would eventually be conscious.

The hype was also dangerous. Because the Perceptron had real limits.

p2_07_minsky_critique

In 1969, Marvin Minsky and Seymour Papert at MIT published Perceptrons, proving mathematically that a single-layer perceptron could not compute XOR (exclusive or). Their proofs were correct. But their conclusion went further: they expressed deep skepticism that multi-layer networks could overcome these limits.

Funding evaporated almost overnight. A striking detail: Minsky and Rosenblatt had both attended the same high school in the Bronx, then championed opposing visions of AI for their entire careers.

The Perceptron was the seed of modern AI: a machine that learns from examples instead of following hand-written rules. Minsky and Papert proved its limits were real -- but they were the limits of a single layer, not of the idea itself. The field confused "this simple version doesn't work" with "the whole approach is hopeless." That confusion cost two decades.

JELLYBOT: Rosenblatt died in 1971. Age 43. Rosenblatt died in 1971. Age 43.

Frank Rosenblatt died in a boating accident on July 11, 1971. He was 43. He never had the chance to respond to Minsky's critique. The idea he pioneered would not be vindicated for another forty years.

Neural network research went dark. Funding vanished. Researchers who believed in the idea had to hide their work under different names. But a few stubborn scientists refused to give up. And in 1986, they published a paper that changed everything. ▶
p3_01_gradient_landscape
PIXEL: 1986. Three researchers proved Minsky wrong. 1986. Three researchers proved Minsky wrong.

The math behind backpropagation had appeared before -- Paul Werbos in 1974, Linnainmaa in 1970, earlier work in control theory. But the 1986 Nature paper by David Rumelhart, Geoffrey Hinton, and Ronald Williams proved something new: backprop could train multi-layer networks to learn meaningful internal representations -- the very thing Minsky doubted.

p3_04_mountain_analogy
Lost in fog. Can't see the valley. But you CAN feel the slope under your feet.

Step 1: Forward pass. Feed an input through the network. Each neuron multiplies inputs by weights, sums them, and passes the result through an activation function.

Step 2: Measure the error. Compare the guess to the correct answer. The difference is the loss.

Step 3: Backward pass. Work backwards using the chain rule. For every weight, ask: would nudging this up make the error go up or down?

Backpropagation -- Step by StepINPUTHIDDEN 1HIDDEN 2OUTPUT"dog"(WRONG!)LOSSCorrect: catFORWARD PASS (signal flows right)BACKWARD PASS (gradients flow left)Each weight is nudged to reduce the error. Repeat millions of times.

Forward pass computes a guess; backward pass traces which weights caused the error.

Step 4: Update weights. Nudge each weight a tiny step in the direction that reduces the error. This is gradient descent -- rolling the ball downhill.

Step 5: Repeat. Do this millions of times, with millions of examples. Gradually, the weights converge on values that produce correct answers.

SPECTACLES: 60 million knobs. Calculus turns each one. 60 million knobs. Calculus turns each one.

A neural network with 60 million weights is like a radio with 60 million knobs. Backpropagation tells you which direction to turn each knob to make the music sound a little better. No human could tune 60 million knobs by hand. But calculus can.

Language Note: Throughout this issue, we say the network "learns" or "sees" -- but these are metaphors. Neural networks find statistical patterns in data. They do not understand meaning the way you do. When a network classifies an image as "cat," it has found pixel patterns that correlate with that label. It has never touched fur or heard a purr.

NEURONBOT: Multiply, add, decide. Repeat millions of times. Multiply, add, decide. Repeat millions of times.
Backpropagation worked. Multi-layer networks could learn. But the excitement was about to collide with a harsh reality: the second AI winter. For nearly twenty years, only a handful of researchers kept the flame alive. ▶
p4_01_desert_march
NAVIGATOR: AI winter hit. Three researchers refused to quit. AI winter hit. Three researchers refused to quit.

The second AI winter (roughly 1987-1993) was triggered by the collapse of expert systems. Companies spent millions on rule-based AI that proved brittle and expensive. Japan's Fifth Generation Computer Project failed. DARPA wound down. The word "AI" itself became toxic in grant applications.

Neural network researchers had it even worse. The dominant approach became Support Vector Machines -- mathematically elegant, with strong guarantees. Next to SVMs, neural networks looked sloppy.

p4_04_hinton
Geoffrey Hinton
b. 1947, London
p4_05_lecun
Yann LeCun
b. 1960, Paris

Hinton -- a descendant of logician George Boole -- spent decades at the University of Toronto. He works standing up due to chronic back problems. "The brain doesn't use logical rules. Why should AI?"

LeCun developed convolutional neural networks (CNNs) at Bell Labs. By the late 1990s, his LeNet-5 was reading millions of handwritten checks at banks across America. A working, deployed neural network. And still, the field didn't notice.

p4_07_bengio
Yoshua Bengio
b. 1964, Paris

Bengio, at the Universite de Montreal, published "A Neural Probabilistic Language Model" in 2003 -- foundational for GPT and all later language models. He chose to stay in academia: "fundamental AI research should stay in public institutions."

Many others contributed: Schmidhuber & Hochreiter (LSTM, 1997), Fukushima (Neocognitron, 1980), Hopfield (1982), Werbos (1974). The field was built by a community.

The "overnight breakthrough" of 2012 was built on decades of persistence. Hinton started in the 1970s. LeCun had working CNNs in the 1980s. Bengio published foundational language work in 2003. In 2018, all three received the ACM Turing Award. In 2024, Hinton (with John Hopfield) received the Nobel Prize in Physics.

Three researchers kept the algorithm alive. But an algorithm needs data -- vast amounts of it. The kind of data deep learning needed did not exist yet. It would take a junior professor with an unconventional idea. ▶
p5_01_feifei_office
LEAFBOT: Fei-Fei Li bet everything on a dataset. Fei-Fei Li bet everything on a dataset.

Fei-Fei Li, then a new professor at Princeton, looked at the problem differently. She was inspired by how children learn: not by rules, but through massive exposure to examples.

Her insight: the field was spending too much effort on algorithms and not enough on data. Colleagues told her it was "a waste of time." Real research meant new algorithms, not data collection.

p5_04_immigration_story

Li had immigrated from Beijing as a teenager. Her parents ran a dry cleaning business in New Jersey. She worked multiple jobs while studying at Princeton. The dataset idea was dismissed by colleagues -- but she pushed forward anyway.

ImageNet -- The Dataset That Changed Everything14+ million images21,000+ categories, labeled by workers worldwide via Amazon Mechanical Turktabby cat1,300+ imagesgolden retriever1,300+ imagesschool bus1,300+ imagesacoustic guitar1,300+ images...and 20,996 more categoriesFEI-FEI LI'S CORE INSIGHT"Stop writing rules for what a cat looks like.Show the machine MILLIONS of cats. Let it figure out the rules on its own."

ImageNet: 14 million images, 21,000 categories -- the fuel for deep learning.

The result was ImageNet: over 14 million labeled images in 21,000+ categories. To label at this scale, Li's team used Amazon Mechanical Turk -- paying workers fractions of a cent per label. The project took three years.

In 2010, she launched the ILSVRC competition: 1,000 categories, 1.2 million training images. The first two years saw modest progress. Traditional methods hit ~25-28% top-5 error.

ILSVRC Competition -- Top-5 Error Rate30%20%10%0%2010~28%2011~26%2012~15.3%!!!~10.8 point gap(unprecedented)Then came 2012. Something extraordinary happened.

The ILSVRC competition: steady progress, then AlexNet shattered every record.

Fei-Fei Li's ImageNet proved that in machine learning, data can be more important than algorithms. A good algorithm with insufficient data learns nothing. A reasonable algorithm with millions of labeled examples can learn to see.

Think About It: ImageNet was not perfect. The dataset reflected biases of its web sources -- certain regions were underrepresented, and some labels carried cultural assumptions. Li has since spoken openly about these issues. Even "letting the data speak" requires careful attention to whose voices the data represents. Bias in training data becomes bias in the model.

WATCHBOT: Bad data in, bad decisions out. Bias is real. Bad data in, bad decisions out. Bias is real.
The dataset was ready. The algorithms existed. But what happened in 2012 was not a gradual improvement. It was an earthquake. A professor, a postdoc, and a grad student -- with two video game graphics cards -- were about to obliterate the competition. ▶
p6_01_conference_reveal
September 2012. The ImageNet results are in.
BRONZEBOT: Toronto's entry: 15.3% error. Everyone else: ~26%. Toronto's entry: 15.3% error. Everyone else: ~26%.

AlexNet was a deep convolutional neural network designed by Alex Krizhevsky, with his supervisor Geoffrey Hinton and fellow student Ilya Sutskever at the University of Toronto. The architecture was not entirely new — it was a CNN, the same basic idea LeCun had pioneered. But AlexNet was dramatically larger.

p6_04_relu_scene

AlexNet's secret weapons:

ReLU activation — if the input is positive, pass it through; if negative, output zero. This solved the vanishing gradient problem — in deep networks, the learning signal weakened through many layers like a message garbled in telephone. ReLU kept the signal strong.

Dropout — randomly switching off neurons during training, forcing robust learning.

NARRATOR: Two GPUs. Cost: $500 each. Changed the world. Two GPUs. Cost: $500 each. Changed the world.

Here is one of the great ironies of technology history. AlexNet was trained on two NVIDIA GTX 580 graphics cards, each with only 3 GB of memory. Graphics cards designed for rendering explosions in video games turned out to be perfectly suited for neural networks. The thousands of parallel cores that computed pixel colors simultaneously could just as easily compute neuron activations.

ILSVRC Results — The Deep Learning Takeover30%20%10%0%2010~28%2011~26%201215.3%AlexNet!20146.7%20153.6%Human ~5.1%Every winner after 2012 used deep neural networks. By 2015, machines beat humans.

The deep learning takeover: from 28% error (2010) to beating humans (2015).

AlexNet was not one breakthrough. It was the convergence of three things: an algorithm that existed since the 1980s (CNNs and backpropagation), a dataset nobody had bothered to build until Fei-Fei Li did (ImageNet), and hardware never designed for AI (GPUs for video games). Breakthroughs often happen not when something new is invented, but when existing pieces finally come together.

AlexNet proved deep learning worked. But WHY did it need GPUs? What is it about neural network training that makes regular CPUs too slow? The answer involves an accidental gift from the video game industry. ▶
p7_01_gpu_split
Same hardware. Two completely different purposes.
TAPEBOT: AI's best hardware? Built for video games. AI's best hardware? Built for video games.

A CPU is like one brilliant mathematician — it solves almost any problem, but works through tasks mostly one at a time. A GPU is like an army of thousands of simpler calculators, all working simultaneously. Rendering a video game frame means computing millions of pixels in parallel — and neural network training involves the exact same kind of work.

CPU vs GPU — Why GPUs Revolutionized AICPU (Central Processing Unit)Core 1Core 2Core 34-16 powerful coresLike ONE brilliant mathematicianGPU (Graphics Processing Unit)Thousands of simple coresLike an ARMY of calculatorsNeural Network Training:Each layer = one giant matrix multiplication= millions of multiply-and-add operations = PERFECT for GPU parallelismCPU: weeks to trainGPU: ~5 days to trainSpeedup: 10x to 70x depending on the task

CPUs: a few powerful cores. GPUs: thousands of simple cores working in parallel.

p7_05_cuda_scene

NVIDIA released CUDA in 2007 — a programming platform for general-purpose computing on GPUs. Andrew Ng's group at Stanford showed GPUs could accelerate deep learning by 10 to 70 times. Ng co-founded Google Brain in 2011, which famously trained a network on YouTube thumbnails — it learned to detect cat-like patterns without ever being labeled "cat."

Without cheap, powerful GPUs, AlexNet could not have been trained in a reasonable time. NVIDIA CEO Jensen Huang, recognizing the opportunity, pivoted the company's entire strategy toward AI. A graphics card company became one of the most valuable companies on Earth.

GPUBOT: GPUs: thousands of workers computing in parallel. GPUs: thousands of workers computing in parallel.

AlexNet trained in less than a week on two GPUs. But as models grew, so did their energy demands. Training GPT-3 consumed an estimated 1,287 MWh of electricity — roughly what 120 US homes use in a year. The deep learning revolution brought incredible capabilities, but also real environmental costs.

SPROUT: Power brings responsibility. Power brings responsibility.
GPUs provided the compute. ImageNet provided the data. Backpropagation provided the algorithm. But what exactly does a neural network learn inside those hidden layers? The answer is one of the most beautiful results in computer science. ▶
p8_01_feature_hierarchy
What does a neural network see?
PIXEL: Nobody programmed this. It found it alone. Nobody programmed this. It found it alone.

Feature visualization reveals what each layer of a trained network has learned to respond to. In a convolutional neural network, the layers form a natural hierarchy — simple features combining into complex ones, all discovered automatically through backpropagation.

What Each Layer of a CNN LearnsINPUTRaw pixelsof a photo[photo]-->LAYER 1Edges, lines,color blobs-->LAYER 2Corners,textures, fur-->LAYER 3+Eyes, ears,paws, noses-->OUTPUTCat faces,whole catscat: 97%SIMPLECOMPLEXNobody programmed this hierarchy. The network DISCOVERED it through backpropagation.This is remarkably similar to how neuroscientists believe the visual cortex works.

Edges become textures become parts become objects — each layer builds on the last.

p8_05_conv_filter_scene

A CNN uses two elegant tricks. Convolutional filters — small sliding windows that scan across the image, so the same edge detector works everywhere. Pooling — periodically shrinking the image to focus on "what" is there rather than "exactly where." This architecture was inspired by Fukushima's Neocognitron (1980) and refined by LeCun at Bell Labs.

NARRATOR: The network arrived at the same solution as the brain. The network arrived at the same solution as the brain.

A deep neural network builds understanding from the bottom up: edges become textures, textures become parts, parts become objects. Each layer is an abstraction built on the layer below — the same principle we have seen throughout this entire series, from transistors to operating systems to programming languages. Deep learning discovered the power of abstraction on its own.

Remember our Language Alert from Page 3: when a CNN classifies an image as "cat" with 97% confidence, it has found statistical patterns in pixels that correlate with the label. It has never touched fur or heard a purr. Whether that counts as "seeing" is a question for Page 10.

p8_10_closing_scene
Neural networks could see. They could classify images better than humans. But vision is just one sense. The really hard problem was language. Could a machine learn to read, write, and reason with words? ▶
p9_01_convergence_timeline
SCHOLAR: Every piece built on what came before. Every piece built on what came before.

The deep learning revolution of the 2010s was not a single invention. It was a convergence — and every piece connects to the story we have been telling since Issue 1.

The Convergence — Every Layer Enabled the NextIssue 1: THEORYTuring: "Can machines think?"Issue 2: HARDWAREVon Neumann, transistorsIssue 3: LANGUAGESLISP, compilers, abstractionIssue 4: SYSTEMSUnix, Bell Labs, CNNsIssue 5: DATAWeb, open source, ImageNetIssue 6: LEARNINGNeural nets + data + GPUsIssue 7: ???Deep learning + languageTuring (1950):"Why not produce a programmethat simulates the child's mind?"McCulloch & Pitts (1943):First mathematical neuron modelLeCun at Bell Labs (1990s):Practical CNNs reading bank checksImageNet from the Web (2009):14M images from the connected worldEvery breakthrough stood on the shoulders of everything that came before.

From Turing's question to deep learning — each issue built the foundations for the next.

The connections extend forward too. Ilya Sutskever, co-author of the AlexNet paper, went on to co-found OpenAI. Bengio's 2003 neural language model was a direct precursor to GPT. The path from "a network that can recognize cats" to "a network that can write poetry" was shorter than anyone expected.

p9_06_paradigm_shift

Machine learning inverts the fundamental relationship between humans and computers. For sixty years, humans told machines exactly what to do, step by step. Machine learning says: here are a million examples — figure out the pattern yourself. This is not just a new technique. It is a new paradigm.

JELLYBOT: From rules to learning. The biggest shift since Turing. From rules to learning. The biggest shift since Turing.
Neural networks learned to see. They surpassed humans at image recognition. But vision is only one kind of intelligence. Could a machine learn to read, write, and reason with words? The answer would come from a paper with one of the boldest titles in scientific history... ▶
p10_01_vision_to_language
By 2015, deep learning had conquered vision. But language was a different beast.

ResNet — a network with 152 layers using "skip connections" — achieved approximately 3.57% top-5 error on ImageNet, surpassing trained human evaluators at ~5.1%. Vision was solved. But language was sequential, contextual, and full of ambiguity. The word "bank" means something different in "river bank" and "bank account."

p10_03_rnn_struggle

Recurrent neural networks (RNNs) and LSTMs could process text one word at a time, maintaining a kind of memory. But they were slow, and they struggled with long passages. By the end of a paragraph, the network had often "forgotten" the beginning. Language AI needed something fundamentally new.

CLOCKWORK: They learned to see. Next: what about language? They learned to see. Next: what about language?

In 2017, a team of eight researchers at Google published a paper with one of the most confident titles in scientific history: "Attention Is All You Need." What they described — the Transformer — would prove to be exactly as important as the title claimed.

The Scoreboard After Issue 6CONQUERED (2012-2015)Image recognition (beats humans)Handwriting recognitionSpeech recognition (improved)Game playing (Atari, 2013)Medical image analysisUNSOLVED?Language understanding?Text generation?Reasoning and conversation?Code generation?Translation at human qualityNEXT: Issue 7 — "Attention Is All You Need"A single paper. Eight authors. The Transformer. The birth of the large language model.

Deep learning conquered vision. Language remained the frontier.

Machine learning is not just a technique. It is a philosophical revolution. For all of computing history, humans translated their understanding into explicit instructions. Machine learning asks: what if the machine could develop its own understanding, directly from experience? This question — first asked by Turing in 1950, pursued by Rosenblatt in 1958, kept alive through decades of doubt, and vindicated by AlexNet in 2012 — defines the future of computing.

Remember the Language Alert from Page 3? When a neural network classifies an image as "cat" with 97% confidence, does it understand what a cat is? It has never touched fur, heard a purr, or been scratched. It has found statistical patterns in pixels. Is that understanding — or something else entirely? This question matters more than ever as we enter the age of language models.

p10_10_closing_scene
References & Further Reading
Next issue: a paper called "Attention Is All You Need" introduces the Transformer — an architecture that lets neural networks process entire passages at once. Within six years, machines go from sentence completion to writing essays, generating code, and holding conversations. Issue 7: "Attention Is All You Need." ▶