The Transformer Blueprint

Numbered Callouts

1 Embedding: "cat" becomes [0.91, 0.42, -0.33, ...] -- a point in meaning-space. Words with similar meanings land near each other. "King" minus "man" plus "woman" equals something close to "queen."

2 Positional encoding uses sine waves of different frequencies so the model knows word ORDER. Without it, "dog bites man" and "man bites dog" would be identical to the model.

3 Self-attention lets every token "look at" every other token and decide how much to pay attention to each one. The word "it" can attend to "cat" three sentences back -- solving a problem that stumped AI for decades.

4 Residual connections (skip connections) add the original input back to the output at each sub-layer. They prevent the signal from fading in deep networks -- without them, a 96-layer model would forget its own input by layer 10.

5 The FFN is where individual token "thinking" happens. Each token is processed independently through two linear layers. The expansion to 4x width (768 to 3072) and back creates a bottleneck that forces the network to learn compressed representations.

6 After 96 layers of this, "The cat sat" produces "on" with 23% confidence. The entire model -- billions of parameters -- exists to convert a sequence of words into a probability distribution over the next word. Generation happens one token at a time: predict, append, repeat.

Scale Reference

"GPT-4 class" model (estimated):

~96 Transformer blocks stacked
~768 -- 12,288 embedding dimensions per token
~50,000 token vocabulary (BPE subwords)
~175 billion+ parameters (learned weights)
Each parameter is a single floating-point number learned during training

Training GPT-3 (175B params) reportedly cost ~$4.6M in compute. Current frontier models likely cost $50M--$100M+. Most of the cost is matrix multiplication -- the same operation, trillions of times.

The Full Architecture (Decoder-Only)

Numbered Callouts

Scale Reference

The QKV Mechanism -- Zoomed In