From Turing to LLMs and Beyond · Issue 6 of 10
Issue 6 · 2000s–2015

Machines That Learn

← Previous Issue: Connecting Everything

Quick. Describe a cat. Not one specific cat — ALL cats. Every breed, every color, every angle. Now write that as precise rules a computer can follow. ‘Has pointy ears’? Some cats have floppy ears. ‘Has fur’? Sphynx cats are hairless. For sixty years, nobody could solve this. Tera

In Issue 5, we watched the world get connected. The web, open source, Wikipedia, GitHub — by the mid-2000s, humanity had assembled the largest collection of knowledge ever created.

But computers still could not do something a toddler does effortlessly: look at a photograph and say “cat.”

For decades, researchers tried the obvious approach: write rules. Detect edges. Measure angles. Describe what a cat “looks like” in mathematical terms. The problem was that “what a cat looks like” is almost infinitely variable.

Traditional Programming: Writing Rules by Hand Photo of animal IF pointy_ears AND whiskers AND fur AND tail AND ... “cat” or “not cat” But what about... Hairless cat (no fur!) Scottish Fold (floppy ears!) Cat in a box (only head visible!) Blurry night photo (barely visible!) Every rule you write has exceptions. The world is too variable.

Children learn to recognize cats the same way they learn everything: by seeing thousands of examples. Nobody teaches a toddler that “a cat has triangular ears at a 47-degree angle.” The child just sees cats — hundreds of them — and the concept emerges.

What if machines could learn the same way?

The Rise and Fall of the Perceptron

Meet Frank Rosenblatt. In 1958, he built a machine that could learn from examples. The Navy predicted it would one day ‘walk, talk, see, and be conscious.’ And then, a single book nearly killed the entire idea for a generation. Tera

Frank Rosenblatt was a psychologist and computer scientist at Cornell. In 1957–1958, funded by the US Navy, he built the Perceptron — the first machine that could genuinely learn from data.

The concept was deceptively simple: take inputs, multiply each by a weight, add them up, and check whether the sum exceeds a threshold. The breakthrough was that the weights were not set by a programmer. The machine adjusted them itself through learning.

How a Perceptron Works (1958) x₁ x₂ x₃ Inputs w₁ w₂ w₃ Σ Threshold: sum > T ? YES or NO Key insight: the weights are LEARNED, not programmed.

On July 8, 1958, the New York Times reported: “New Navy Device Learns by Doing.” The hype was enormous.

Then in 1969, Marvin Minsky and Seymour Papert — both at MIT — published Perceptrons, proving that a single-layer perceptron could not compute certain important functions like XOR. The field confused “this simple version does not work” with “the whole approach is hopeless.” Funding evaporated. Rosenblatt died in a boating accident in 1971, at 43, never having the chance to respond.

The Perceptron was the seed of modern AI: a machine that learns from examples instead of following hand-written rules. Minsky and Papert proved its limits were real — but they were the limits of a single layer, not of the idea itself. That confusion cost two decades.

Learning by Flowing Downhill

Minsky and Papert proved single-layer networks were limited, and most people gave up on ALL neural networks. In 1986, three researchers proved the skeptics wrong. Imagine you are lost in mountains in thick fog. You CAN feel which way the ground slopes. So you step downhill. That is backpropagation. Tera

The 1986 Nature paper by David Rumelhart, Geoffrey Hinton, and Ronald Williams demonstrated that backpropagation could train multi-layer neural networks — the very thing Minsky and Papert had been skeptical about.

The mathematical idea had appeared earlier — Paul Werbos described it in his 1974 PhD thesis, and Seppo Linnainmaa published automatic differentiation in 1970. But Rumelhart, Hinton, and Williams showed convincingly that it worked.

Step 0 / 4
Think About It: A neural network with 60 million weights is like a radio with 60 million knobs. Backpropagation tells you which direction to turn each knob to make the music sound a little better. No human could tune 60 million knobs by hand. But calculus can — one tiny adjustment at a time, repeated millions of times.

Crossing the Desert

The second AI winter hit in the late 1980s. Expert systems crashed. Neural networks were still tainted by the 1969 critique. Working on neural nets in the 1990s was a career risk. Most researchers abandoned the field. Three did not. Tera

The second AI winter (roughly 1987–1993) was triggered by the collapse of expert systems. Japan’s Fifth Generation Project failed. DARPA wound down. The word “AI” became toxic in grant applications.

The dominant approach in the 1990s was Support Vector Machines — mathematically elegant, with guaranteed solutions. Next to SVMs, neural networks looked unprincipled.

Through all of this, three researchers refused to abandon neural networks:

The Three Who Kept the Faith Geoffrey Hinton b. 1947, London U. of Toronto → Google Backpropagation, Boltzmann machines, deep belief nets “The brain doesn’t use logical rules. Why should AI?” Nobel Prize 2024 Yann LeCun b. 1960, Paris Bell Labs → NYU → Meta Convolutional neural networks (CNNs), LeNet LeNet-5 read 10%+ of US bank checks by the late 1990s. Name: Le Cun → LeCun Yoshua Bengio b. 1964, Paris U. de Montreal → Mila Neural language models, GANs (via student Goodfellow) Stayed in academia when industry came calling with millions. AI safety advocate 2018: Joint recipients of the ACM Turing Award ~2004: Rebranded “neural networks” as “deep learning” via CIFAR program

Hinton, reportedly a descendant of logician George Boole, has chronic back problems and works standing up. He kept publishing on neural networks when almost nobody would cite them.

LeCun developed convolutional neural networks at Bell Labs. His LeNet-5 read handwritten checks for US banks — a working, deployed neural network. The broader field barely noticed.

Bengio published a foundational 2003 paper on neural language models — a direct precursor to GPT. He chose to stay in academia, saying fundamental AI research should remain in public institutions.

They were not alone — Schmidhuber and Hochreiter invented LSTM in 1997, and John Hopfield had reignited interest in neural networks in 1982. But these three, through the CIFAR program, built the community that led the revival.

Tragically, Rumelhart developed a neurodegenerative disease in 1998 and died in 2011 — he never lived to see the deep learning revolution his work made possible.

The deep learning revolution was not an overnight success. It was a thirty-five-year vigil. The “overnight breakthrough” of 2012 was built on decades of persistence by researchers the establishment dismissed.

Fourteen Million Images

Meet Fei-Fei Li. Born in Beijing in 1976. Her family immigrated to the US when she was a teenager. She cleaned houses and worked at a restaurant while attending Princeton on scholarship. And she had an insight that some of the most powerful people in AI told her was a waste of time. Tera

Fei-Fei Li looked at the computer vision problem differently. She was inspired by how children learn: through massive exposure to visual examples. Her insight: the field was spending too much effort on algorithms and not enough on data.

Colleagues told her it was “a waste of time.” She pushed forward anyway.

ImageNet — The Dataset That Changed Everything 14+ million labeled images 21,000+ categories 3 years to build Example categories: tabby cat (1,300+) golden retriever school bus acoustic guitar ...and 20,996 more categories “We decided to let the data speak for itself.” — Fei-Fei Li

The result was ImageNet: over 14 million labeled images organized into more than 21,000 categories. Li’s team used Amazon Mechanical Turk, paying workers around the world fractions of a cent per label. The labeling work, done for fractions of a cent per image, would later raise important questions about the invisible human labor behind AI.

ImageNet also taught a harder lesson: massive datasets reflect the biases of the world they come from. If training images underrepresent certain regions, the AI inherits those biases.

In 2010, she launched the ILSVRC — an annual competition. The first two years saw modest progress with traditional methods: top-5 error around 25–28%.

Then came 2012.

Fei-Fei Li’s ImageNet proved that in machine learning, data can be more important than algorithms. A good algorithm with insufficient data learns nothing. A reasonable algorithm with millions of labeled examples can learn to see.

The AlexNet Earthquake

September 30, 2012. Every team using traditional methods scores around 25–26% error. Then the University of Toronto entry is revealed: 15.3%. A TEN-POINT gap. Like someone showing up to a bicycle race in a jet. It has been called a ‘Sputnik moment’ for AI. The revolution had begun. Neuron

AlexNet was a deep convolutional neural network designed by Alex Krizhevsky, with his supervisor Geoffrey Hinton and fellow student Ilya Sutskever at the University of Toronto.

Deep networks had been hard to train because of the vanishing gradient problem: error signals weakened as they traveled backward through many layers. AlexNet used ReLU activation, which kept the signal strong, along with dropout (preventing memorization), and — crucially — was trained on two NVIDIA GTX 580 GPUs designed for video games.

ILSVRC 2012 Results — Top-5 Error Rate 30% 25% 20% 15% 10% 5% 15.3% AlexNet Traditional methods (~25–26%) ~10.8 point gap! After AlexNet: 2013 → 11.7% | 2014 → 6.7% | 2015 → 3.6% (beats humans at ~5.1%)

AlexNet had approximately 60 million parameters. It trained for about 5–6 days on two GPUs. After AlexNet, every competitive ILSVRC entry used deep neural networks. By 2015, deep learning surpassed human-level performance.

AlexNet was not one breakthrough. It was the convergence of three things: an algorithm from the 1980s (CNNs and backpropagation), a dataset nobody had bothered to build until Fei-Fei Li did (ImageNet), and hardware never designed for AI (GPUs for video games). Breakthroughs often happen when existing pieces finally come together.

The Accidental AI Hardware

Here is one of the great ironies of technology history. The hardware that made the AI revolution possible was designed to make video game explosions look cooler. NVIDIA spent years perfecting chips for millions of pixels. Turns out, pixel math and neural network math are the same thing. Tera
CPU vs. GPU — Why GPUs Revolutionized AI CPU (Central Processing Unit) CORE 1 CORE 2 CORE 3 CORE 4 Few powerful cores (4–16) GPU (Graphics Processing Unit) Thousands of simple cores (1000s+) One brilliant mathematician An army of 1000s of calculators CPU: weeks to train AlexNet → GPU: ~5 days → Speedup: 10x to 70x Neural network math = millions of multiply-and-add operations = perfect for GPU parallelism

NVIDIA released CUDA in 2007, enabling general-purpose computing on GPUs. Andrew Ng’s group at Stanford (2009) showed GPUs could accelerate deep learning by 10–70x. The Google Brain project (2011) trained a network on YouTube frames that spontaneously learned to detect cats through unsupervised learning — it was not specifically trained to find cats, but the ability emerged on its own.

NVIDIA CEO Jensen Huang pivoted the company’s strategy toward AI. A graphics card company became one of the most valuable companies on Earth.

Think About It: The GPU was never designed for AI. It was designed for video games. A technology built for entertainment accidentally enabled one of the most important scientific revolutions of the century. Can you think of other technologies invented for one purpose but transformative for a completely different one?

What Neural Networks See

Nobody TOLD the network to look for edges first, then textures, then parts, then objects. It discovered this hierarchy on its own. And here is what is remarkable: this echoes how neuroscientists believe your visual cortex begins processing. But real brains use feedback loops we don’t fully understand. The neural network is a simplified echo, not a copy. Tera

Step 1 / 5

A CNN achieves this through two elegant design choices: convolutional filters — small sliding windows that scan across the image — and pooling — periodically shrinking the image to focus on “what” is there rather than “exactly where.”

This architecture was inspired by Kunihiko Fukushima’s Neocognitron (1980) and by David Hubel and Torsten Wiesel’s Nobel Prize-winning research on how the visual cortex processes information.

A deep neural network builds understanding from the bottom up: edges become textures, textures become parts, parts become objects. Each layer is an abstraction built on the layer below — the same principle we have seen throughout this series, from transistors to operating systems to programming languages. Deep learning discovered the power of abstraction on its own.

The Foundations Beneath the Revolution

Let me show you something. None of this happened in isolation. Turing’s math gave us computable functions. Von Neumann gave us hardware. Languages gave us tools. The internet gave us data. Stubborn researchers gave us the algorithms. This is a story about layers. Tera
The Convergence — Every Layer Enabled the Next Issue 1 THEORY — “Can machines think?” Turing (1936, 1950) Issue 2 HARDWARE — Von Neumann architecture, transistors, chips Issue 3 LANGUAGES — LISP, compilers, abstraction Issue 4 SYSTEMS — Unix, C, Bell Labs (where LeCun built CNNs) Issue 5 DATA — The web, open source, billions of images & texts Issue 6 LEARNING — Neural networks + data + GPUs = deep learning
The Paradigm Shift Traditional programming: Human writes RULES → machine follows Machine learning: Human provides EXAMPLES → machine discovers This inversion is the most fundamental change in computing since the stored-program concept.
Machine learning inverts the fundamental relationship between humans and computers. For sixty years, humans told machines exactly what to do. Machine learning says: here are a million examples of the right answer — figure out the pattern yourself. This is not just a new technique. It is a new paradigm.

Can They Read? Can They Write?

Look at what happened. Three stubborn researchers, a visionary professor, a massive dataset, and two video game graphics cards proved that machines could discover their own rules. Neural nets learned to see, recognize faces, identify diseases, drive cars. And every breakthrough raised the same question: if machines can understand images, what about language? What about thought? Tera

By 2015, deep learning had conquered computer vision. ResNet — 152 layers with “skip connections” — achieved ~3.57% error on ImageNet, surpassing trained human evaluators at ~5.1%.

But language was a different beast. Images are grids of numbers. Language is sequential, contextual, and ambiguous. The word “bank” means something different in “river bank” and “bank account.” The best tool for sequential data was LSTM (Long Short-Term Memory), invented by Sepp Hochreiter and Jürgen Schmidhuber in 1997. LSTMs could remember context over long sequences — but they processed words one at a time, which was slow.

In 2017, a team of eight researchers at Google published a paper with one of the most confident titles in scientific history: “Attention Is All You Need.” What they described — the Transformer — was an architecture for machine translation. But it would soon be adapted to power systems that write essays, generate code, and hold conversations that feel almost human.

Conquered by 2015 ✓ Image recognition (better than humans) ✓ Handwriting recognition ✓ Speech recognition ✓ Game playing (Atari, 2013) ✓ Medical image analysis Still Unsolved ? Language understanding ? Text generation ? Reasoning and conversation ? Code generation ? Translation at human quality The gap between SEEING and READING was about to close.
Machine learning is not just a technique. It is a philosophical revolution. For all of computing history, humans translated their understanding into explicit instructions. Machine learning asks: what if the machine could develop its own understanding, directly from experience? This question — first asked by Turing in 1950, kept alive through decades of doubt, and vindicated by AlexNet in 2012 — defines the future of computing.
Think About It: When a neural network classifies an image as “cat” with 97% confidence, does it “understand” what a cat is? It has never touched fur, heard a purr, or been scratched. It has found statistical patterns in pixels. Is that understanding — or something else entirely? Where is the line between pattern matching and knowledge?
Next Issue: A paper called “Attention Is All You Need” introduces the Transformer — an architecture that lets neural networks process entire passages at once. Within six years, machines go from stuttering sentence completion to writing essays, generating code, and holding conversations. The age of the large language model begins. Issue 7: “Attention Is All You Need” →

References & Further Reading