Tokenization & Embeddings
How text becomes vectors — the pipeline from raw characters to dense numerical representations that LLMs understand.
Starting Point: Raw Text
Neural networks work with numbers, not characters. The string "The cat sat on the mat" is just bytes — the model can't process it directly.
We need to convert text to numerical representations.
- Models need numerical input
- Text is just character sequences
- Conversion required before processing
Step 1: Split into Tokens
A tokenizer splits text into tokens — subwords, words, or characters. Most LLMs use Byte-Pair Encoding (BPE), which learns common subword patterns.
Tokenizers often mark leading spaces with a special symbol (Ġ in GPT-2, ▁ in SentencePiece) — representation varies by tokenizer.
- Tokens are the atomic units
- BPE: common substrings become tokens
- Vocabulary: ~50,000 tokens typical
Step 2: Tokens to IDs
Each token maps to an integer ID via the vocabulary. "The" might be token 464, "_cat" might be 3797.
If a word isn't in the vocabulary (rare), it's split into smaller known tokens.
- Vocabulary is fixed at training time
- Each token has unique ID
- Unknown words → subword fallback
Step 3: IDs to Vectors
An embedding table maps each token ID to a dense vector. These vectors are learned during training — the model discovers what each token "means."
Dimensions: GPT-2 uses 768, GPT-3 uses up to 12,288.
- Embedding = learned vector per token
- High-dimensional (hundreds to thousands)
- Captures semantic meaning
Semantic Meaning in Vectors
In embedding space, semantically similar words are close together. The
classic example:
king - man + woman ≈ queen. Vector arithmetic captures
relationships.
This is how LLMs "understand" language — through geometric relationships between embeddings.
- Similar meaning → similar vectors
- Relationships captured as directions
- Foundation of semantic understanding