β Previous Issue: Attention Is All You Need
In Issue 7, we watched language models learn to write code. You could type a question into a chat window, and an LLM would hand you a function, a class, an entire module. It was extraordinary.
But there was a catch.
You still had to copy the code into your project. You still had to run it yourself. When it failed β and it often did β you had to read the error, go back to the chat, paste the error message, and ask for a fix. Then copy THAT code, run it again, hit a different error, go back to the chat...
Between 2021 and 2025, that wall came down. AI went from suggesting code to writing, running, testing, debugging, and fixing code β all on its own.
This is the story of how AI learned to be not just a writer of code, but a doer of code. And it happened in four distinct leaps.
GitHub Copilot β The First Taste of AI Coding
On June 29, 2021, GitHub announced Copilot as a technical preview. It was built on OpenAI Codex, a 12-billion-parameter language model fine-tuned on billions of lines of publicly available code.
Copilot was a VS Code extension. As you typed, it predicted what came next β not just the next word, but often entire functions. Type a comment describing what you wanted, and Copilot would write the implementation.
Under GitHub CEO Nat Friedman β an open-source advocate who co-founded a Linux desktop company before joining Microsoft β GitHub bet that AI-assisted coding was the future.
Copilot became generally available on June 21, 2022, at $10/month. By February 2023, it had over one million paying subscribers. An internal study claimed developers completed tasks 55% faster β though critics noted the study used a simple task and was conducted by GitHub itself.
But Copilot also sparked controversy. In November 2022, programmer Matthew Butterick filed a class-action lawsuit against GitHub, Microsoft, and OpenAI, arguing that training on copyleft-licensed code without attribution amounted to "software piracy at an unprecedented scale." The legal question β where does "learning from" end and "reproducing" begin? β remains unresolved.
Chat-Based Coding β "Just Describe What You Want"
On November 30, 2022, OpenAI launched ChatGPT. It reached 100 million monthly active users by January 2023 β roughly two months β the fastest-growing consumer application in history at the time.
Developers discovered it was shockingly good at coding. Describe a problem in plain English, and ChatGPT would hand you working code. Paste in an error message and ask "What went wrong?" β it would explain and fix it.
When GPT-4 arrived on March 14, 2023, the leap in code quality was dramatic. Claude, from Anthropic, entered the space in March 2023, followed by Claude 2 in July with a 100,000-token context window.
But there was a maddening problem. The AI could write code β but it could not run code. Every interaction followed the same exhausting loop:
The Tool-Use Revolution β AI Learns to Use a Computer
In June 2023, OpenAI introduced function calling for GPT-3.5 and GPT-4. Instead of just generating text, a model could now output structured JSON saying: "Call this function with these arguments."
The concept traced back to the Toolformer paper (Schick et al., February 2023), which showed LLMs could learn to use tools β calculators, search engines, translators β by themselves.
In July 2023, ChatGPT's Code Interpreter was released. For the first time, a mainstream AI could write Python code AND run it in a sandbox, see the output, and iterate. The AI could finally touch the real world.
Anthropic introduced tool use for Claude models in early 2024, making it generally available with the Claude 3 family. Google's Gemini and Meta's LLaMA-based models followed β tool use became a universal pattern.
In March 2023, Auto-GPT went explosively viral on GitHub β the first attempt at a fully autonomous AI agent. It frequently got stuck in loops and burned through API credits, but it captured the public imagination about what AI agents could become.
The ReAct Loop β Think, Act, Observe, Repeat
In October 2022, Princeton PhD student Shunyu Yao and colleagues published the ReAct paper β "Synergizing Reasoning and Acting in Language Models." It established the foundational pattern for all LLM-based agents.
Before ReAct, there were two separate approaches: chain-of-thought (reasoning without actions) and action-only (tool use without reasoning). ReAct combined them β and outperformed both.
The pattern is deceptively simple: Think about what to do. Act β use a tool. Observe the result. Repeat until done. It is essentially the scientific method applied to AI.
AI Moves Into the Terminal
Paul Gauthier's Aider pioneered the open-source terminal-agent approach, demonstrating that an AI could work directly in a developer's command line. Cursor (by Anysphere) offered an AI-native code editor. In early 2025, Anthropic β co-founded by Dario Amodei and Daniela Amodei, who serves as President β released Claude Code, an agentic coding tool that operates in the terminal. The terminal was becoming the new frontier for AI-assisted development.
Unlike chat-based coding (where you copy-paste between windows), Claude Code acts autonomously within a plan-execute-observe loop. It can read your codebase, edit files, run shell commands, execute tests, interact with git, and iterate on errors β all without human intervention.
The shift was fundamental: from AI as oracle (you ask questions, it gives answers) to AI as worker (you describe a task, it performs the task). You can see every action the agent takes and interrupt or redirect it at any time.
The competitive landscape was exploding. Windsurf (by Codeium) combined chat with agentic "Cascade" flows. Amazon, Google, and others all launched their own AI coding tools. The roots of this wave stretched back to early 2023, when Auto-GPT became the fastest-growing GitHub repository of its time β a crude but visionary experiment in chaining GPT-4 into an autonomous loop. It burned through API credits and frequently got stuck, but it proved the concept that captured millions of imaginations: an AI that acts on its own.
The market evolved from autocomplete (2021) to chat-in-IDE (2023) to AI-native IDEs (2024) to autonomous agents (2025) β each wave building on the last.
The Context Window Problem β Why Agents "Forget"
A context window is the maximum amount of text an LLM can process at once β its working memory. Everything the agent has read, written, and thought about in a session must fit inside this window.
Context windows have grown dramatically:
But bigger is not a cure-all. Research by Liu et al. (2023) revealed the "Lost in the Middle" phenomenon: models pay more attention to information at the beginning and end of the context, with degraded recall of information in the middle.
For coding agents, this means: over long sessions, early decisions and constraints get effectively "forgotten." Critical project structure information fades. The agent's reasoning quality degrades as the conversation grows.
Developers have found workarounds β summarizing earlier context, pulling in relevant code on demand (retrieval-augmented generation), and breaking tasks into smaller sub-tasks β but none fully solve the problem.
Error Recovery β Watching an Agent Debug Itself
A key capability that separates agents from simple code generators is error recovery. When a solution fails β a test breaks, code does not compile, a command returns an error β the agent can read the error message, reason about the cause, and attempt a fix.
Good error recovery requires:
1. Error classification β is it a syntax error, a logic error, or an environment issue?
2. Root cause analysis β the error says line 15, but maybe the real problem is on line 8.
3. Targeted fix β change only what is necessary, do not rewrite everything.
4. Loop detection β recognize when repeated attempts are not converging on a solution.
The Limits of a Single Agent β When One Brain Isn't Enough
By 2025, single-agent coding was genuinely impressive. But honest observers β including the teams building these agents β recognized hard limits:
The SWE-bench benchmark β created by researchers at Princeton to evaluate coding agents on 2,294 real GitHub issues from 12 popular Python repositories β provided a sobering reality check. Even the best agents could solve only a fraction of these issues fully autonomously.
In March 2024, Cognition Labs β founded by Scott Wu, an IOI gold medalist β announced Devin, marketed as "the first AI software engineer." The demo generated enormous excitement β and then backlash when independent developers found some claims were overstated. The episode taught the field a critical lesson: agents need rigorous benchmarks, not curated demos.
Bridge: One Agent Is Powerful. But What About a Team?
In just four years β from Copilot's autocomplete in 2021 to autonomous agents in 2025 β AI went from whispering suggestions to doing the work. The progression was breathtaking:
Autocomplete finished your sentences. Chat wrote whole solutions. Tool use gave AI hands to interact with the world. Agents combined thinking and doing into a self-directed loop.
But the story is not over. A single agent hits walls: context fills up, errors compound, parallelization is impossible. The same walls that limit a single person on a team.
And humans solved that problem thousands of years ago. We do not tackle big projects alone. We form teams. We specialize. We divide the work. We check each other.
What if AI could do the same?