Tokenization & Embeddings

How text becomes vectors — the pipeline from raw characters to dense numerical representations that LLMs understand.

Based on: What is a Model Probability Basics Normalization

1 / ?

Starting Point: Raw Text

Neural networks work with numbers, not characters. The string "The cat sat on the mat" is just bytes — the model can't process it directly.

We need to convert text to numerical representations.

Models need numerical input
Text is just character sequences
Conversion required before processing

Step 1: Split into Tokens

A tokenizer splits text into tokens — subwords, words, or characters. Most LLMs use Byte-Pair Encoding (BPE), which learns common subword patterns.

Tokenizers often mark leading spaces with a special symbol (Ġ in GPT-2, ▁ in SentencePiece) — representation varies by tokenizer.

Tokens are the atomic units
BPE: common substrings become tokens
Vocabulary: ~50,000 tokens typical

Step 2: Tokens to IDs

Each token maps to an integer ID via the vocabulary. "The" might be token 464, "_cat" might be 3797.

If a word isn't in the vocabulary (rare), it's split into smaller known tokens.

Vocabulary is fixed at training time
Each token has unique ID
Unknown words → subword fallback

Step 3: IDs to Vectors

An embedding table maps each token ID to a dense vector. These vectors are learned during training — the model discovers what each token "means."

Dimensions: GPT-2 uses 768, GPT-3 uses up to 12,288.

Embedding = learned vector per token
High-dimensional (hundreds to thousands)
Captures semantic meaning

Semantic Meaning in Vectors

In embedding space, semantically similar words are close together. The classic example: king - man + woman ≈ queen. Vector arithmetic captures relationships.

This is how LLMs "understand" language — through geometric relationships between embeddings.

Similar meaning → similar vectors
Relationships captured as directions
Foundation of semantic understanding

What's Next?

Now that you understand tokenization and embeddings, explore related patterns: Attention Mechanism for how transformers use these embeddings, Backpropagation for how embeddings are learned, and What is a Model for the broader context.