← Back to Issue 7

Tokenizer Deep Dive

How "understanding" becomes [under][stand][ing] — and why it matters.

Interactive companion to Issue 7: Attention Is All You Need

1. Your Text

Type anything below, or pick a pre-loaded example. The tokenizer will break your text into tokens using Byte-Pair Encoding (BPE) — the same algorithm behind GPT, Claude, and other LLMs.

2. BPE Vocabulary Builder

Watch Byte-Pair Encoding build a vocabulary from scratch. It starts with individual characters, then repeatedly merges the most frequent adjacent pair into a new token.

Speed: 5x
Step 0 | Vocabulary: 0 tokens | Text: 0 tokens
Press Step or Auto-Run to begin building the vocabulary.

Learned Vocabulary

3. Token Visualization

Your text after BPE tokenization. Hover over any token to see its ID and frequency. Each unique token gets a distinct color.

Your text = 0 tokens

4. Vocabulary Size Comparison

The same text tokenized with different vocabulary sizes. A bigger vocabulary means fewer tokens — but the vocabulary table itself takes more memory.

Key insight: Larger vocabulary = fewer tokens = faster processing. But the vocabulary itself takes more memory! Modern LLMs use 30K–100K token vocabularies as a sweet spot.

5. Language & Type Comparison

Different kinds of text tokenize very differently. English prose is "cheap" because LLMs train mostly on English. Code, emoji, and other languages cost more tokens for the same meaning.

Real-world impact: Sending code to an LLM uses up your context window faster than plain English! Japanese and Chinese text can cost 2–4x more tokens per word because these writing systems are underrepresented in training data.

6. Token → Number Mapping

Every token maps to an integer ID, which then maps to an embedding vector — a point in high-dimensional space. Similar words end up near each other.

TokenIDEmbedding (768-dim, first 20 shown)
This is how "the" becomes [0.23, -0.15, 0.87, ...] — a point in 768-dimensional space. Words with similar meanings (cat, kitten, dog) end up as nearby points, enabling the model to understand relationships.

7. How Much Does Your Text Cost?

LLM APIs charge per token. Here is what processing your text would cost across different models.

1x (your text)
A novel (~100,000 words) is roughly 130,000 tokens — about $0.40 to process with GPT-4o, or $1.95 with Claude Opus.