Illustration companion to Issue 7: Attention Is All You Need
Every component of the architecture that changed AI -- labeled and explained
Bottom to top = input to output. This is what GPT and Claude look like inside.
Click "Step" or "Play" to walk through the transformer data flow
"cat" becomes [0.91, 0.42, -0.33, ...] -- a point in meaning-space. Words with similar meanings land near each other. "King" minus "man" plus "woman" equals something close to "queen."
"GPT-4 class" model (estimated):
~96 Transformer blocks stacked~768 -- 12,288 embedding dimensions per token~50,000 token vocabulary (BPE subwords)~175 billion+ parameters (learned weights)Training GPT-3 (175B params) reportedly cost ~$4.6M in compute. Current frontier models likely cost $50M--$100M+. Most of the cost is matrix multiplication -- the same operation, trillions of times.
How one attention head processes three tokens: "The", "cat", "sat"
From Turing to LLMs and Beyond -- Illustration Series