In Issue 5, we watched the world get connected. The web, open source, Wikipedia, GitHub — by the mid-2000s, humanity had assembled the largest collection of knowledge ever created.
But computers still could not do something a toddler does effortlessly: look at a photograph and say “cat.”
For decades, researchers tried the obvious approach: write rules. Detect edges. Measure angles. Describe what a cat “looks like” in mathematical terms. The problem was that “what a cat looks like” is almost infinitely variable.
Children learn to recognize cats the same way they learn everything: by seeing thousands of examples. Nobody teaches a toddler that “a cat has triangular ears at a 47-degree angle.” The child just sees cats — hundreds of them — and the concept emerges.
What if machines could learn the same way?
The Rise and Fall of the Perceptron
Frank Rosenblatt was a psychologist and computer scientist at Cornell. In 1957–1958, funded by the US Navy, he built the Perceptron — the first machine that could genuinely learn from data.
The concept was deceptively simple: take inputs, multiply each by a weight, add them up, and check whether the sum exceeds a threshold. The breakthrough was that the weights were not set by a programmer. The machine adjusted them itself through learning.
On July 8, 1958, the New York Times reported: “New Navy Device Learns by Doing.” The hype was enormous.
Then in 1969, Marvin Minsky and Seymour Papert — both at MIT — published Perceptrons, proving that a single-layer perceptron could not compute certain important functions like XOR. The field confused “this simple version does not work” with “the whole approach is hopeless.” Funding evaporated. Rosenblatt died in a boating accident in 1971, at 43, never having the chance to respond.
The Perceptron was the seed of modern AI: a machine that learns from examples instead of following hand-written rules. Minsky and Papert proved its limits were real — but they were the limits of a single layer, not of the idea itself. That confusion cost two decades.
Learning by Flowing Downhill
The 1986 Nature paper by David Rumelhart, Geoffrey Hinton, and Ronald Williams demonstrated that backpropagation could train multi-layer neural networks — the very thing Minsky and Papert had been skeptical about.
The mathematical idea had appeared earlier — Paul Werbos described it in his 1974 PhD thesis, and Seppo Linnainmaa published automatic differentiation in 1970. But Rumelhart, Hinton, and Williams showed convincingly that it worked.
Step 0 / 4
Think About It: A neural network with 60 million weights is like a radio with 60 million knobs. Backpropagation tells you which direction to turn each knob to make the music sound a little better. No human could tune 60 million knobs by hand. But calculus can — one tiny adjustment at a time, repeated millions of times.
Crossing the Desert
The second AI winter (roughly 1987–1993) was triggered by the collapse of expert systems. Japan’s Fifth Generation Project failed. DARPA wound down. The word “AI” became toxic in grant applications.
The dominant approach in the 1990s was Support Vector Machines — mathematically elegant, with guaranteed solutions. Next to SVMs, neural networks looked unprincipled.
Through all of this, three researchers refused to abandon neural networks:
Hinton, reportedly a descendant of logician George Boole, has chronic back problems and works standing up. He kept publishing on neural networks when almost nobody would cite them.
LeCun developed convolutional neural networks at Bell Labs. His LeNet-5 read handwritten checks for US banks — a working, deployed neural network. The broader field barely noticed.
Bengio published a foundational 2003 paper on neural language models — a direct precursor to GPT. He chose to stay in academia, saying fundamental AI research should remain in public institutions.
They were not alone — Schmidhuber and Hochreiter invented LSTM in 1997, and John Hopfield had reignited interest in neural networks in 1982. But these three, through the CIFAR program, built the community that led the revival.
Tragically, Rumelhart developed a neurodegenerative disease in 1998 and died in 2011 — he never lived to see the deep learning revolution his work made possible.
The deep learning revolution was not an overnight success. It was a thirty-five-year vigil. The “overnight breakthrough” of 2012 was built on decades of persistence by researchers the establishment dismissed.
Fourteen Million Images
Fei-Fei Li looked at the computer vision problem differently. She was inspired by how children learn: through massive exposure to visual examples. Her insight: the field was spending too much effort on algorithms and not enough on data.
Colleagues told her it was “a waste of time.” She pushed forward anyway.
The result was ImageNet: over 14 million labeled images organized into more than 21,000 categories. Li’s team used Amazon Mechanical Turk, paying workers around the world fractions of a cent per label. The labeling work, done for fractions of a cent per image, would later raise important questions about the invisible human labor behind AI.
ImageNet also taught a harder lesson: massive datasets reflect the biases of the world they come from. If training images underrepresent certain regions, the AI inherits those biases.
In 2010, she launched the ILSVRC — an annual competition. The first two years saw modest progress with traditional methods: top-5 error around 25–28%.
Then came 2012.
Fei-Fei Li’s ImageNet proved that in machine learning, data can be more important than algorithms. A good algorithm with insufficient data learns nothing. A reasonable algorithm with millions of labeled examples can learn to see.
The AlexNet Earthquake
AlexNet was a deep convolutional neural network designed by Alex Krizhevsky, with his supervisor Geoffrey Hinton and fellow student Ilya Sutskever at the University of Toronto.
Deep networks had been hard to train because of the vanishing gradient problem: error signals weakened as they traveled backward through many layers. AlexNet used ReLU activation, which kept the signal strong, along with dropout (preventing memorization), and — crucially — was trained on two NVIDIA GTX 580 GPUs designed for video games.
AlexNet had approximately 60 million parameters. It trained for about 5–6 days on two GPUs. After AlexNet, every competitive ILSVRC entry used deep neural networks. By 2015, deep learning surpassed human-level performance.
AlexNet was not one breakthrough. It was the convergence of three things: an algorithm from the 1980s (CNNs and backpropagation), a dataset nobody had bothered to build until Fei-Fei Li did (ImageNet), and hardware never designed for AI (GPUs for video games). Breakthroughs often happen when existing pieces finally come together.
The Accidental AI Hardware
NVIDIA released CUDA in 2007, enabling general-purpose computing on GPUs. Andrew Ng’s group at Stanford (2009) showed GPUs could accelerate deep learning by 10–70x. The Google Brain project (2011) trained a network on YouTube frames that spontaneously learned to detect cats through unsupervised learning — it was not specifically trained to find cats, but the ability emerged on its own.
NVIDIA CEO Jensen Huang pivoted the company’s strategy toward AI. A graphics card company became one of the most valuable companies on Earth.
Think About It: The GPU was never designed for AI. It was designed for video games. A technology built for entertainment accidentally enabled one of the most important scientific revolutions of the century. Can you think of other technologies invented for one purpose but transformative for a completely different one?
What Neural Networks See
Step 1 / 5
A CNN achieves this through two elegant design choices: convolutional filters — small sliding windows that scan across the image — and pooling — periodically shrinking the image to focus on “what” is there rather than “exactly where.”
This architecture was inspired by Kunihiko Fukushima’s Neocognitron (1980) and by David Hubel and Torsten Wiesel’s Nobel Prize-winning research on how the visual cortex processes information.
A deep neural network builds understanding from the bottom up: edges become textures, textures become parts, parts become objects. Each layer is an abstraction built on the layer below — the same principle we have seen throughout this series, from transistors to operating systems to programming languages. Deep learning discovered the power of abstraction on its own.
The Foundations Beneath the Revolution
Machine learning inverts the fundamental relationship between humans and computers. For sixty years, humans told machines exactly what to do. Machine learning says: here are a million examples of the right answer — figure out the pattern yourself. This is not just a new technique. It is a new paradigm.
Can They Read? Can They Write?
By 2015, deep learning had conquered computer vision. ResNet — 152 layers with “skip connections” — achieved ~3.57% error on ImageNet, surpassing trained human evaluators at ~5.1%.
But language was a different beast. Images are grids of numbers. Language is sequential, contextual, and ambiguous. The word “bank” means something different in “river bank” and “bank account.” The best tool for sequential data was LSTM (Long Short-Term Memory), invented by Sepp Hochreiter and Jürgen Schmidhuber in 1997. LSTMs could remember context over long sequences — but they processed words one at a time, which was slow.
In 2017, a team of eight researchers at Google published a paper with one of the most confident titles in scientific history: “Attention Is All You Need.” What they described — the Transformer — was an architecture for machine translation. But it would soon be adapted to power systems that write essays, generate code, and hold conversations that feel almost human.
Machine learning is not just a technique. It is a philosophical revolution. For all of computing history, humans translated their understanding into explicit instructions. Machine learning asks: what if the machine could develop its own understanding, directly from experience? This question — first asked by Turing in 1950, kept alive through decades of doubt, and vindicated by AlexNet in 2012 — defines the future of computing.
Think About It: When a neural network classifies an image as “cat” with 97% confidence, does it “understand” what a cat is? It has never touched fur, heard a purr, or been scratched. It has found statistical patterns in pixels. Is that understanding — or something else entirely? Where is the line between pattern matching and knowledge?
Next Issue: A paper called “Attention Is All You Need” introduces the Transformer — an architecture that lets neural networks process entire passages at once. Within six years, machines go from stuttering sentence completion to writing essays, generating code, and holding conversations. The age of the large language model begins.
Issue 7: “Attention Is All You Need” →