Tokenization & Embeddings

Read this as How does raw text become model input geometry?

Failure Trap: Assuming words map one-to-one to tokens or that embeddings preserve obvious human categories.
Decision Rule: Inspect tokenization before estimating cost or context, and inspect embeddings before trusting retrieval behavior.

1 / ?

Starting Point: Raw Text

Neural networks work with numbers, not characters. The string "The cat sat" is just bytes — the model can't process it directly.

We need to convert text to numerical representations.

Models need numerical input
Text is just character sequences
Conversion required before processing

Step 1: Split into Tokens

A tokenizer splits text into tokens — subwords, words, or characters. Most LLMs use Byte-Pair Encoding (BPE), which learns common subword patterns.

Tokenizers mark a leading space with a special symbol (Ġ in GPT-2, ▁ in SentencePiece; shown here as ·) — the exact glyph varies by tokenizer. Our 3-word sentence becomes 3 tokens.

Tokens are the atomic units
BPE: common substrings become tokens
Vocabulary: ~50,000 tokens typical

Step 2: Tokens to IDs

Each token maps to an integer ID via the vocabulary. "The" might be token 464, "·cat" might be 3797.

If a word isn't in the vocabulary (rare), it's split into smaller known tokens.

Vocabulary is fixed at training time
Each token has unique ID
Unknown words → subword fallback

Step 3: IDs to Vectors

An embedding table maps each token ID to a dense vector. These vectors are learned during training — the model discovers what each token "means."

Dimensions: GPT-2 uses 768, GPT-3 uses up to 12,288.

Embedding = learned vector per token
High-dimensional (hundreds to thousands)
Captures semantic meaning

Semantic Meaning in Vectors

In embedding space, semantically similar words are close together (cat and dog sit near each other). The classic example: king - man + woman ≈ queen. The arrow from man to king is the same length and direction as the arrow from woman to queen — that parallel offset is the "royalty" relationship.

This is how LLMs "understand" language — through geometric relationships between embeddings.

Similar meaning → similar vectors
Relationships captured as directions
Foundation of semantic understanding