← Back to Issue 7: Attention Is All You Need

The Embedding Galaxy

How words become numbers — and why ‘king - man + woman = queen’

Illustration companion to Issue 7: Attention Is All You Need

1

Words as Points in Space

In an embedding space, every word is a point. Words with similar meanings cluster together — like stars forming constellations. This is a 2D projection of a high-dimensional space where proximity means similarity.

Animals
Countries
Professions
Emotions
Food
Technology
cat dog fish bird horse lion tiger whale mouse rabbit France Germany Japan Brazil India China USA Italy Spain doctor teacher engineer artist lawyer chef pilot nurse happy sad angry joy scared calm love excited hate pizza pasta sushi bread rice apple banana chocolate computer software internet code algorithm neural machine Animals Countries Professions Emotions Food Technology vast empty space between unrelated concepts
Words that appear in similar contexts end up near each other in the embedding space. The model was never told that “cat” and “dog” are both animals — it discovered this structure from patterns in text.
2

Vector Arithmetic

The most stunning property of embeddings: you can do math with meanings. Subtract one concept, add another, and arrive at the correct answer — because directions in the space encode relationships.

The Famous Example king - man (royalty concept) + woman queen cosine similarity: 0.86 king - man + woman = queen More examples — same pattern Paris - France + Italy = Rome capital-of relationship bigger - big + small = smaller comparative form
The model was never taught these relationships. They emerged from reading billions of words. Directions in embedding space encode consistent semantic relationships — gender, geography, grammar, and more.
3

How a Word Becomes a Vector

Every word passes through the same pipeline: from text to token ID to a dense vector of numbers. Here is the journey of the word “cat”.

STEP 1 “cat” the raw word in the input text STEP 2 Token ID: 9246 look up “cat” in the vocabulary table STEP 3: EMBEDDING LOOKUP row 0 row 1 ... row 9246 ... ... 768 columns ... ... 768 values for “cat” ... 50,000 rows (one per token) ← this row is “cat” STEP 4: THE VECTOR [0.23, -0.15, 0.87, 0.04, -0.62, ...] 768 numbers STEP 5: WHAT THE DIMENSIONS MEAN (conceptual) Some dimensions encode “animal-ness” · Some encode “size” · Some encode “domesticated vs wild” But most dimensions don’t have clean human interpretations — they capture subtle statistical patterns.
4

Similarity = Distance

How do we measure if two words are similar? We compare their vectors. Cosine similarity measures the angle between two vectors: 1.0 means identical, 0.0 means unrelated.

Comparing vectors as heatmap bars “cat” ... 768 dims “kitten” Cosine Similarity 0.89 very similar close “cat” “dog” Cosine Similarity 0.73 related similar “cat” “computer” Cosine Similarity 0.12 unrelated far apart Heatmap scale: negative zero positive
This is how search engines find relevant results — by finding vectors close to your query. When you type a question, it gets embedded into the same space, and the engine returns documents whose vectors are nearest.
5

The Embedding Matrix

All embeddings live in one giant table: 50,000 rows (one per token in the vocabulary) by 768 columns (one per dimension). That is 38.4 million parameters — just for the first layer.

768 dimensions → 50,000 tokens ↓ ... ~49,970 more rows “the” token 1 “cat” token 9246 “quantum” token 42157 50,000 x 768 = 38,400,000 parameters just for the first layer of the Transformer These numbers are LEARNED during training. Nobody set them by hand. The model discovered what each dimension should mean.
6

Beyond Words

The embedding trick is not limited to words. The same idea — turn anything into a vector — works for images, audio, and code. Every modality gets mapped into a similar high-dimensional space.

Embedding Space [vectors] Text “The cat sat on” words → tokens → vectors Images patches → vectors Audio spectrograms → vectors Code def hello(): print(“hi”) tokens → vectors
The same trick works for images, audio, and code — turn anything into a vector. This is why multimodal models (like those that understand both text and images) are possible: everything lives in the same kind of space.