Illustration companion to Issue 7: Attention Is All You Need

The Transformer Blueprint

Every component of the architecture that changed AI -- labeled and explained

The Full Architecture (Decoder-Only)

Bottom to top = input to output. This is what GPT and Claude look like inside.

Ready

Click "Step" or "Play" to walk through the transformer data flow

INPUT "The cat sat" Tokenizer "The" "cat" "sat" ID: 464 ID: 9246 ID: 3532 Token Embedding Layer 1 [0.23, -0.15, 0.87, ...] The [0.91, 0.42, -0.33, ...] cat [-0.56, 0.78, 0.14, ...] sat Each token becomes a point in 768-dimensional space Positional Encoding (Added) 2 sin cos embed + position = combined Without this, "dog bites man" = "man bites dog" x N layers (12 to 96 times) Transformer Block Layer Norm 1 Stabilizes numbers Multi-Head Self-Attention 3 Input split into three projections: Q (Query) "What am I looking for?" K (Key) "What do I contain?" V (Value) "What info to pass?" Split across attention heads: Head 1 syntax Head 2 coreference Head 3 semantics Head 4 position Each head: Q x K^T / sqrt(d_k) --> softmax --> x V The cat sat The cat sat Attention matrix darker = stronger attention each row sums to 1.0 Concatenate all heads --> Linear projection 4 Residual Connection 1 + Layer Norm 2 Feed-Forward Network (FFN) 5 Linear 768 --> 3072 GELU Linear 3072 --> 768 expand, then compress Residual Connection 2 + OUTPUT HEAD Final Layer Norm Linear Projection 768 dims --> 50,257 (vocabulary size) Softmax --> Probability Distribution Next Token Prediction "on" 23% "down" 15% "up" 12% "in" 8% ... Assigns a probability to every word in its ~50,000-word vocabulary 6 Legend: Embedding Attention FFN LayerNorm Residual Text --> Tokens --> Embeddings + Position --> (LayerNorm --> Attention --> Residual --> LayerNorm --> FFN --> Residual) x N --> Next Token Decoder-only architecture (GPT, Claude, LLaMA, etc.)

Numbered Callouts

1 Embedding: "cat" becomes [0.91, 0.42, -0.33, ...] -- a point in meaning-space. Words with similar meanings land near each other. "King" minus "man" plus "woman" equals something close to "queen."
2 Positional encoding uses sine waves of different frequencies so the model knows word ORDER. Without it, "dog bites man" and "man bites dog" would be identical to the model.
3 Self-attention lets every token "look at" every other token and decide how much to pay attention to each one. The word "it" can attend to "cat" three sentences back -- solving a problem that stumped AI for decades.
4 Residual connections (skip connections) add the original input back to the output at each sub-layer. They prevent the signal from fading in deep networks -- without them, a 96-layer model would forget its own input by layer 10.
5 The FFN is where individual token "thinking" happens. Each token is processed independently through two linear layers. The expansion to 4x width (768 to 3072) and back creates a bottleneck that forces the network to learn compressed representations.
6 After 96 layers of this, "The cat sat" produces "on" with 23% confidence. The entire model -- billions of parameters -- exists to convert a sequence of words into a probability distribution over the next word. Generation happens one token at a time: predict, append, repeat.

Scale Reference

"GPT-4 class" model (estimated):

Training GPT-3 (175B params) reportedly cost ~$4.6M in compute. Current frontier models likely cost $50M--$100M+. Most of the cost is matrix multiplication -- the same operation, trillions of times.

The QKV Mechanism -- Zoomed In

How one attention head processes three tokens: "The", "cat", "sat"

Step 1: Input token vectors x_1 "The" [0.23, -0.15, 0.87, ...] x_2 "cat" [0.91, 0.42, -0.33, ...] x_3 "sat" [-0.56, 0.78, 0.14, ...] each: d_model = 768 Step 2: Multiply by learned weight matrices W_Q Query weights W_K Key weights W_V Value weights Each matrix: 768 x 64 (d_model x d_k) Result: Q, K, V vectors for each token (d_k = 64) Q vectors q_The q_cat q_sat K vectors k_The k_cat k_sat V vectors v_The v_cat v_sat Attention Scores --> Step 3: Compute attention scores Q x K^T = raw scores k_The k_cat k_sat q_The q_cat q_sat 8.2 2.1 0.5 1.8 7.6 1.2 0.3 2.4 9.1 Step 4: Scale Divide each score by sqrt(64) = 8 Prevents scores from becoming too large for softmax [1.03, 0.26, 0.06] [0.23, 0.95, 0.15] [0.04, 0.30, 1.14] row for "The" row for "cat" row for "sat" Step 5: Softmax (rows sum to 1.0) 0.55 0.26 0.19 = 1.0 "The" mostly attends to itself 0.18 0.60 0.22 = 1.0 "cat" mostly attends to itself 0.09 0.25 0.66 = 1.0 "sat" mostly attends to itself Step 6: Multiply attention weights x V --> output Each token's output is a weighted combination of all Value vectors Q = Query (blue) K = Key (gold) V = Value (green)

From Turing to LLMs and Beyond -- Illustration Series

Back to Issue 7: Attention Is All You Need