From Turing to LLMs and Beyond — Issue 8: The Agent Awakens

← Previous Issue: Attention Is All You Need

In Issue 7, we watched language models learn to write code. You could type a question into a chat window, and an LLM would hand you a function, a class, an entire module. It was extraordinary.

But there was a catch.

You still had to copy the code into your project. You still had to run it yourself. When it failed — and it often did — you had to read the error, go back to the chat, paste the error message, and ask for a fix. Then copy THAT code, run it again, hit a different error, go back to the chat...

Between 2021 and 2025, that wall came down. AI went from suggesting code to writing, running, testing, debugging, and fixing code — all on its own.

This is the story of how AI learned to be not just a writer of code, but a doer of code. And it happened in four distinct leaps.

For eighty years, we made computers easier for humans to use. Now we were making computers that used themselves.

GitHub Copilot — The First Taste of AI Coding

On June 29, 2021, GitHub announced Copilot as a technical preview. It was built on OpenAI Codex, a 12-billion-parameter language model fine-tuned on billions of lines of publicly available code.

Copilot was a VS Code extension. As you typed, it predicted what came next — not just the next word, but often entire functions. Type a comment describing what you wanted, and Copilot would write the implementation.

Under GitHub CEO Nat Friedman — an open-source advocate who co-founded a Linux desktop company before joining Microsoft — GitHub bet that AI-assisted coding was the future.

Copilot became generally available on June 21, 2022, at $10/month. By February 2023, it had over one million paying subscribers. An internal study claimed developers completed tasks 55% faster — though critics noted the study used a simple task and was conducted by GitHub itself.

But Copilot also sparked controversy. In November 2022, programmer Matthew Butterick filed a class-action lawsuit against GitHub, Microsoft, and OpenAI, arguing that training on copyleft-licensed code without attribution amounted to "software piracy at an unprecedented scale." The legal question — where does "learning from" end and "reproducing" begin? — remains unresolved.

Think About It: "If you read a thousand cookbooks and then write your own recipe, is that copying? What if your recipe matches someone else's word for word? At what point does 'learning from' become 'copying'?"

Chat-Based Coding — "Just Describe What You Want"

On November 30, 2022, OpenAI launched ChatGPT. It reached 100 million monthly active users by January 2023 — roughly two months — the fastest-growing consumer application in history at the time.

Developers discovered it was shockingly good at coding. Describe a problem in plain English, and ChatGPT would hand you working code. Paste in an error message and ask "What went wrong?" — it would explain and fix it.

When GPT-4 arrived on March 14, 2023, the leap in code quality was dramatic. Claude, from Anthropic, entered the space in March 2023, followed by Claude 2 in July with a 100,000-token context window.

But there was a maddening problem. The AI could write code — but it could not run code. Every interaction followed the same exhausting loop:

Human: Write a function that checks if a string is a palindrome. AI: def is_palindrome(s): s = s.lower().replace(" ", "") return s == s[::-1]

The AI was brilliant — but locked in a soundproof room. It could write you a plan and slip it under the door, but it could never see your project, run your code, or read your errors. The human was the bottleneck.

The Tool-Use Revolution — AI Learns to Use a Computer

In June 2023, OpenAI introduced function calling for GPT-3.5 and GPT-4. Instead of just generating text, a model could now output structured JSON saying: "Call this function with these arguments."

The concept traced back to the Toolformer paper (Schick et al., February 2023), which showed LLMs could learn to use tools — calculators, search engines, translators — by themselves.

In July 2023, ChatGPT's Code Interpreter was released. For the first time, a mainstream AI could write Python code AND run it in a sandbox, see the output, and iterate. The AI could finally touch the real world.

Anthropic introduced tool use for Claude models in early 2024, making it generally available with the Claude 3 family. Google's Gemini and Meta's LLaMA-based models followed — tool use became a universal pattern.

In March 2023, Auto-GPT went explosively viral on GitHub — the first attempt at a fully autonomous AI agent. It frequently got stuck in loops and burned through API credits, but it captured the public imagination about what AI agents could become.

Tool use was the key that unlocked the door. An LLM with tools isn't just a text generator — it's an entity that can interact with the real world. Read files. Run programs. Search the internet. The wall between "thinking about code" and "doing code" was dissolving.

The ReAct Loop — Think, Act, Observe, Repeat

In October 2022, Princeton PhD student Shunyu Yao and colleagues published the ReAct paper — "Synergizing Reasoning and Acting in Language Models." It established the foundational pattern for all LLM-based agents.

Before ReAct, there were two separate approaches: chain-of-thought (reasoning without actions) and action-only (tool use without reasoning). ReAct combined them — and outperformed both.

The pattern is deceptively simple: Think about what to do. Act — use a tool. Observe the result. Repeat until done. It is essentially the scientific method applied to AI.

Step 0 / 7

The ReAct pattern — Think, Act, Observe, Repeat — became the beating heart of every AI agent. It is the scientific method, the OODA loop, and the debugging process all rolled into one. A graduate student at Princeton and his collaborators gave AI its operating system.

AI Moves Into the Terminal

Paul Gauthier's Aider pioneered the open-source terminal-agent approach, demonstrating that an AI could work directly in a developer's command line. Cursor (by Anysphere) offered an AI-native code editor. In early 2025, Anthropic — co-founded by Dario Amodei and Daniela Amodei, who serves as President — released Claude Code, an agentic coding tool that operates in the terminal. The terminal was becoming the new frontier for AI-assisted development.

Unlike chat-based coding (where you copy-paste between windows), Claude Code acts autonomously within a plan-execute-observe loop. It can read your codebase, edit files, run shell commands, execute tests, interact with git, and iterate on errors — all without human intervention.

The shift was fundamental: from AI as oracle (you ask questions, it gives answers) to AI as worker (you describe a task, it performs the task). You can see every action the agent takes and interrupt or redirect it at any time.

$ claude "Fix the login bug in auth.py" Reading project structure... Think: I'll read auth.py to understand the login flow. Act: read_file("src/auth.py") Obs: Line 42: if password == None: # bug! Think: Should use "is None" not "== None". Let me fix and test. Act: edit_file("src/auth.py", line=42, new="if password is None:") Act: run("pytest tests/test_auth.py") Obs: 14 passed, 0 failed Done. Fixed identity comparison on line 42. All tests pass.

The competitive landscape was exploding. Windsurf (by Codeium) combined chat with agentic "Cascade" flows. Amazon, Google, and others all launched their own AI coding tools. The roots of this wave stretched back to early 2023, when Auto-GPT became the fastest-growing GitHub repository of its time — a crude but visionary experiment in chaining GPT-4 into an autonomous loop. It burned through API credits and frequently got stuck, but it proved the concept that captured millions of imaginations: an AI that acts on its own.

The market evolved from autocomplete (2021) to chat-in-IDE (2023) to AI-native IDEs (2024) to autonomous agents (2025) — each wave building on the last.

Terminal agents didn't just write code — they did the work. Read files. Run tests. Fix bugs. Commit changes. The advisor had left the soundproof room, sat down at the keyboard, and started typing.

The Context Window Problem — Why Agents "Forget"

A context window is the maximum amount of text an LLM can process at once — its working memory. Everything the agent has read, written, and thought about in a session must fit inside this window.

Context windows have grown dramatically:

But bigger is not a cure-all. Research by Liu et al. (2023) revealed the "Lost in the Middle" phenomenon: models pay more attention to information at the beginning and end of the context, with degraded recall of information in the middle.

For coding agents, this means: over long sessions, early decisions and constraints get effectively "forgotten." Critical project structure information fades. The agent's reasoning quality degrades as the conversation grows.

Developers have found workarounds — summarizing earlier context, pulling in relevant code on demand (retrieval-augmented generation), and breaking tasks into smaller sub-tasks — but none fully solve the problem.

Think About It: "Your brain can hold about 7 things in short-term memory at once. An AI agent's context window is vastly larger — but the principle is the same. Even with 200K tokens, you can't read a whole library with a flashlight. What matters is not just size, but what you point the light at."

Error Recovery — Watching an Agent Debug Itself

A key capability that separates agents from simple code generators is error recovery. When a solution fails — a test breaks, code does not compile, a command returns an error — the agent can read the error message, reason about the cause, and attempt a fix.

Good error recovery requires:

1. Error classification — is it a syntax error, a logic error, or an environment issue?

2. Root cause analysis — the error says line 15, but maybe the real problem is on line 8.

3. Targeted fix — change only what is necessary, do not rewrite everything.

4. Loop detection — recognize when repeated attempts are not converging on a solution.

Error recovery is what makes agents useful in the real world. A code generator hands you an answer and walks away. An agent checks its own work, notices when something breaks, and tries again. The best agents know when to try a completely different approach — and when to ask the human for help.

The Limits of a Single Agent — When One Brain Isn't Enough

By 2025, single-agent coding was genuinely impressive. But honest observers — including the teams building these agents — recognized hard limits:

The SWE-bench benchmark — created by researchers at Princeton to evaluate coding agents on 2,294 real GitHub issues from 12 popular Python repositories — provided a sobering reality check. Even the best agents could solve only a fraction of these issues fully autonomously.

In March 2024, Cognition Labs — founded by Scott Wu, an IOI gold medalist — announced Devin, marketed as "the first AI software engineer." The demo generated enormous excitement — and then backlash when independent developers found some claims were overstated. The episode taught the field a critical lesson: agents need rigorous benchmarks, not curated demos.

One agent is like one person. Brilliant, but limited. It can only focus on one thing. It eventually forgets. Its mistakes compound without a second pair of eyes. What if, instead of one agent, you had a whole team?

Bridge: One Agent Is Powerful. But What About a Team?

In just four years — from Copilot's autocomplete in 2021 to autonomous agents in 2025 — AI went from whispering suggestions to doing the work. The progression was breathtaking:

Autocomplete finished your sentences. Chat wrote whole solutions. Tool use gave AI hands to interact with the world. Agents combined thinking and doing into a self-directed loop.

But the story is not over. A single agent hits walls: context fills up, errors compound, parallelization is impossible. The same walls that limit a single person on a team.

And humans solved that problem thousands of years ago. We do not tackle big projects alone. We form teams. We specialize. We divide the work. We check each other.

What if AI could do the same?

The Leap: From autocomplete to chat to tools to agents — AI learned to code, then to act, then to think while acting. But one agent is still one brain. The next frontier is not smarter individuals — it is coordinated teams. Multiple agents, each with a role, communicating through shared files, checking each other's work. A swarm.

Next Issue: From one agent to many. Researchers, writers, reviewers, critics — each specialized, each independent, all working together. This is the future of AI. Issue 9: "The Swarm" →

The Agent Awakens

References & Further Reading