Tokenization & Embeddings

How text becomes vectors — the pipeline from raw characters to dense numerical representations that LLMs understand.

Read this as How does raw text become model input geometry?
Failure Trap
Assuming words map one-to-one to tokens or that embeddings preserve obvious human categories.
Decision Rule
Inspect tokenization before estimating cost or context, and inspect embeddings before trusting retrieval behavior.
From text to vectors: the tokenization and embedding pipeline Five steps showing how the sentence "The cat sat" becomes numbers an LLM can use: raw text, splitting into tokens, mapping tokens to integer IDs, looking up a dense vector for each ID, and finally how those vectors place similar words near each other so that king minus man plus woman lands near queen. Raw text isn't numbers "The cat sat" How can a network read this? Models do math, not language. Text is just characters — it must become numbers. Step 1 · Split into tokens "The cat sat" tokenize The ·cat ·sat BPE splits text into reusable subwords. · marks a leading space. 3 tokens here. Step 2 · Tokens to IDs The ·cat ·sat look up in vocabulary 464 3797 3002 Vocabulary (fixed) "The" → 464 "·cat" → 3797 "·sat" → 3002 …50k+ Step 3 · IDs to vectors 464 3797 3002 look up in table [0.12, -0.34, 0.89, …] [0.45, 0.21, -0.15, …] [-0.18, 0.67, 0.33, …] 768 numbers per token, learned in training. Meaning becomes geometry man king woman queen cat dog king − man + woman ≈ queen Equal arrows = same relation. Close dots = similar meaning.
1 / ?

Starting Point: Raw Text

Neural networks work with numbers, not characters. The string "The cat sat" is just bytes — the model can't process it directly.

We need to convert text to numerical representations.

  • Models need numerical input
  • Text is just character sequences
  • Conversion required before processing

Step 1: Split into Tokens

A tokenizer splits text into tokens — subwords, words, or characters. Most LLMs use Byte-Pair Encoding (BPE), which learns common subword patterns.

Tokenizers mark a leading space with a special symbol (Ġ in GPT-2, ▁ in SentencePiece; shown here as ·) — the exact glyph varies by tokenizer. Our 3-word sentence becomes 3 tokens.

  • Tokens are the atomic units
  • BPE: common substrings become tokens
  • Vocabulary: ~50,000 tokens typical

Step 2: Tokens to IDs

Each token maps to an integer ID via the vocabulary. "The" might be token 464, "·cat" might be 3797.

If a word isn't in the vocabulary (rare), it's split into smaller known tokens.

  • Vocabulary is fixed at training time
  • Each token has unique ID
  • Unknown words → subword fallback

Step 3: IDs to Vectors

An embedding table maps each token ID to a dense vector. These vectors are learned during training — the model discovers what each token "means."

Dimensions: GPT-2 uses 768, GPT-3 uses up to 12,288.

  • Embedding = learned vector per token
  • High-dimensional (hundreds to thousands)
  • Captures semantic meaning

Semantic Meaning in Vectors

In embedding space, semantically similar words are close together (cat and dog sit near each other). The classic example: king - man + woman ≈ queen. The arrow from man to king is the same length and direction as the arrow from woman to queen — that parallel offset is the "royalty" relationship.

This is how LLMs "understand" language — through geometric relationships between embeddings.

  • Similar meaning → similar vectors
  • Relationships captured as directions
  • Foundation of semantic understanding