Skip to content
~90s Visual Explainer

Tokenization & Embeddings

How text becomes vectors — the pipeline from raw characters to dense numerical representations that LLMs understand.

Starting Point: Raw Text "The cat sat on the mat" How do neural networks read this? Models need numbers, not characters Text is just bytes — must be converted to numerical form Step 1: Split into Tokens "The cat sat on the mat" tokenize The _cat _sat _on _the _mat 6 tokens BPE (Byte-Pair Encoding) _ = leading space (part of the token) Tokens are the atomic units for the model Step 2: Tokens to IDs The _cat _sat _on _the _mat Vocabulary "The" → 464 "_cat" → 3797 "_sat" → 3002 ... 50K+ tokens 464 3797 3002 319 262 2917 [464, 3797, 3002, 319, 262, 2917] Fixed vocabulary Unknown words split into known subwords Step 3: IDs to Vectors 464 3797 3002 ... Embedding Table ID 464 → [0.12, -0.34, ...] ID 3797 → [0.45, 0.21, ...] ID 3002 → [-0.18, 0.67, ...] 768 dimensions Dense Vectors (Embeddings) [0.12, -0.34, 0.89, ...] [0.45, 0.21, -0.15, ...] [-0.18, 0.67, 0.33, ...] Learned during training e.g., 768 dims Semantic Meaning in Vectors Embedding Space (2D projection) cat dog pet table chair king man woman queen Vector Arithmetic king - man + woman ≈ queen Relationships as directions! Semantic Similarity Similar meaning → close vectors Foundation of LLM understanding
1 / ?

Starting Point: Raw Text

Neural networks work with numbers, not characters. The string "The cat sat on the mat" is just bytes — the model can't process it directly.

We need to convert text to numerical representations.

  • Models need numerical input
  • Text is just character sequences
  • Conversion required before processing

Step 1: Split into Tokens

A tokenizer splits text into tokens — subwords, words, or characters. Most LLMs use Byte-Pair Encoding (BPE), which learns common subword patterns.

Tokenizers often mark leading spaces with a special symbol (Ġ in GPT-2, ▁ in SentencePiece) — representation varies by tokenizer.

  • Tokens are the atomic units
  • BPE: common substrings become tokens
  • Vocabulary: ~50,000 tokens typical

Step 2: Tokens to IDs

Each token maps to an integer ID via the vocabulary. "The" might be token 464, "_cat" might be 3797.

If a word isn't in the vocabulary (rare), it's split into smaller known tokens.

  • Vocabulary is fixed at training time
  • Each token has unique ID
  • Unknown words → subword fallback

Step 3: IDs to Vectors

An embedding table maps each token ID to a dense vector. These vectors are learned during training — the model discovers what each token "means."

Dimensions: GPT-2 uses 768, GPT-3 uses up to 12,288.

  • Embedding = learned vector per token
  • High-dimensional (hundreds to thousands)
  • Captures semantic meaning

Semantic Meaning in Vectors

In embedding space, semantically similar words are close together. The classic example: king - man + woman ≈ queen. Vector arithmetic captures relationships.

This is how LLMs "understand" language — through geometric relationships between embeddings.

  • Similar meaning → similar vectors
  • Relationships captured as directions
  • Foundation of semantic understanding

What's Next?

Now that you understand tokenization and embeddings, explore related patterns: Attention Mechanism for how transformers use these embeddings, Backpropagation for how embeddings are learned, and What is a Model for the broader context.