Skip to content

Ai-engineering Series

Text to Tokens - The Foundation

Deep dive into tokenization: why models can't read text directly, subword algorithms like BPE, practical patterns, and the pitfalls that cause production failures

Concepts Covered in This Article

Why Tokenization Matters

Tokenization seems like preprocessing trivia. It’s not.

Every LLM interaction starts here. Tokenization determines what the model can see, how much it costs, and what it can generate. Bugs at this layer cause failures that are extremely difficult to diagnose because they look like model problems.

If you skip understanding tokenization:

  • You’ll wonder why non-English users complain about costs (hint: they’re paying 3-5x more per concept)
  • You’ll struggle with structured output when tokens don’t align with your expected boundaries
  • You’ll miscount context window usage and either truncate critical information or pay for waste

What Goes Wrong Without This:

Symptom: Your LLM can't do basic arithmetic reliably.
Cause: Numbers tokenize inconsistently. "1234" might be one token,
       two tokens, or four tokens depending on the tokenizer.

Symptom: Japanese/Arabic/Hindi users report higher costs.
Cause: Tokenizer trained primarily on English. Other languages
       require 3-5x more tokens for the same semantic content.

Symptom: Model outputs "Hel" instead of "Hello"
Cause: Token boundaries don't align with word boundaries.
       "Hello" and " Hello" are different tokens.

Why Models Can’t Read Text

Neural networks are math machines. They do matrix multiplications. They need numbers.

"Hello world"  ???  [0.23, -0.41, 0.89, ...]

The ??? is tokenization.

Text is a sequence of characters. Characters have Unicode values, but using raw Unicode doesn’t work:

"cat"  [99, 97, 116]        # ASCII/Unicode
"Cat"  [67, 97, 116]        # Different numbers!
"CAT"  [67, 65, 84]         # Totally different

Models would have to learn that these mean the same thing
from raw numbers alone. Possible, but wasteful.

Why Not Just Split on Words?

Problem 1: Vocabulary explosion
  English has ~170,000 words in common use.
  Add proper nouns, technical terms, typos...
  "COVID-19" — in your vocabulary?
  "ketankhairnar" — definitely not.

Problem 2: Out-of-vocabulary (OOV)
  Unknown words  <UNK> token
  Model has no idea what <UNK> means.

Problem 3: No subword sharing
  "run", "running", "runner" are clearly related.
  Word-level treats them as completely separate.

Subword Tokenization

The solution: break words into pieces. Common pieces become tokens.

"unhappiness"  ["un", "happi", "ness"]

Benefits:
- Finite vocabulary (typically 32K-100K tokens)
- Rare words decompose into known pieces
- Morphology captured ("un-" prefix, "-ness" suffix)

Three dominant algorithms:

AlgorithmUsed ByKey Idea
BPE (Byte Pair Encoding)GPT, LLaMA, ClaudeMerge frequent byte pairs
WordPieceBERTMerge to maximize likelihood
SentencePieceT5, multilingualLanguage-agnostic, treats text as bytes

BPE: How It Works

Start with character-level vocabulary. Repeatedly merge the most frequent adjacent pair.

Training corpus: "low lower lowest"

Step 0 - Character vocabulary:
  ['l', 'o', 'w', 'e', 'r', 's', 't', ' ']

Step 1 - Count adjacent pairs:
  'lo' appears 3 times (in low, lower, lowest)
  'ow' appears 3 times
  ...

Step 2 - Merge most frequent: 'lo'  'lo'
Step 3 - Repeat until vocabulary size reached (e.g., 50,000 tokens)

Result: Common words become single tokens. Rare words decompose.

"the"             ["the"]                    # 1 token
"understanding"   ["under", "standing"]      # 2 tokens
"defenestration"  ["def", "en", "est", "ration"] # 4 tokens

Practical Tokenization Patterns

Whitespace handling:

Most tokenizers include leading whitespace in tokens:

"Hello world"  ["Hello", " world"]
                         ^ space is part of token

This is why " world" and "world" are different tokens.

Case sensitivity:

Usually case-sensitive:
"Hello"  [token_123]
"hello"  [token_456]
"HELLO"  [token_789] or decomposed: ["HE", "LLO"]

Numbers:

Numbers often tokenize digit-by-digit or in chunks:

"123"      ["123"]          # if common
"12345"    ["123", "45"]    # chunked
"3.14159"  ["3", ".", "14", "159"]

Arithmetic is hard because digits aren't reliably grouped.

Code:

Code tokenization varies wildly:

Python: "def foo():"  ["def", " foo", "():", ...]
JSON:   "{"key":"    ["{", '"', "key", '"', ":"]

Common patterns (def, function, return)  single tokens
Rare identifiers  decomposed to pieces

Tokenizer Differences Matter:

OpenAI (GPT-4):     "Hello world"  [9906, 1917]
Anthropic (Claude): "Hello world"  [different IDs]
Meta (LLaMA):       "Hello world"  [different IDs]

The token IDs mean completely different things.
You cannot mix tokenizers and models.

The Context Window Problem

Every model has a maximum context length. Measured in tokens, not characters.

+---------------+----------------------------+
| Model         | Context Length (tokens)    |
+---------------+----------------------------+
| GPT-4         | 8K / 32K / 128K            |
| Claude 3      | 200K                       |
| LLaMA 3       | 8K (extendable)            |
| Gemini 1.5    | 1M (preview)               |
+---------------+----------------------------+

Tokens ≠ Characters ≠ Words:

Rule of thumb for English:
  1 token ≈ 4 characters ≈ 0.75 words

"The quick brown fox jumps over the lazy dog"
  Characters: 43
  Words: 9
  Tokens: ~11

But this varies by language and content type.

Tokenization Pitfalls

Non-English Languages

Tokenizers trained primarily on English.
Other languages get worse token efficiency.

English:  "hello"      1 token
Japanese: "こんにちは"  3-5 tokens (same meaning!)
Arabic:   "مرحبا"      4-6 tokens

Same semantic content, 3-5x more tokens.
This means:
  - Higher costs
  - Smaller effective context window
  - Sometimes worse model performance

Rare Words and Neologisms

"COVID-19" (pre-2020 tokenizer): ["CO", "VID", "-", "19"]
"ChatGPT" (early tokenizer):     ["Chat", "G", "PT"]

Model must infer meaning from pieces.
Usually works, but costs more tokens.

Adversarial Inputs

Unicode tricks can break tokenizers:

"Hello" vs "Ηello"  # second H is Greek Eta
  - Look identical
  - Different tokens
  - Model might behave differently

Whitespace characters:
  Regular space vs non-breaking space vs zero-width space
  - Visually same
  - Different tokens

Token Boundaries Affect Generation

If you want the model to output exactly "Hello":

  Prompt: "Say Hello"
  Output: "Hello" 

But if tokenizer makes "Hello"  ["Hel", "lo"]:
  Model might generate "Hel" then something else.

Special Tokens

Every tokenizer has reserved tokens for structure:

+----------------+--------------------------------+
| Token          | Purpose                        |
+----------------+--------------------------------+
| <BOS>/<s>      | Beginning of sequence          |
| <EOS></s>      | End of sequence                |
| <PAD>          | Padding for batch processing   |
| <UNK>          | Unknown token (rare in BPE)    |
| <|im_start|>   | Message boundary (chat models) |
| [INST]         | Instruction marker             |
+----------------+--------------------------------+

These are NOT in your text. They're added by formatting.
Chat templates use them to structure conversations.

Code Example

Minimal tokenization exploration with tiktoken (OpenAI’s tokenizer):

import tiktoken

# Load GPT-4 tokenizer
enc = tiktoken.encoding_for_model("gpt-4")

def explore_tokenization(text: str) -> None:
    """Show how text becomes tokens."""
    tokens = enc.encode(text)
    print(f"Text: {text!r}")
    print(f"Tokens: {tokens}")
    print(f"Count: {len(tokens)}")

    # Decode each token to see the pieces
    pieces = [enc.decode([t]) for t in tokens]
    print(f"Pieces: {pieces}")
    print()

# Common word - single token
explore_tokenization("hello")

# Compound word - multiple tokens
explore_tokenization("unbelievable")

# Code - mixed tokenization
explore_tokenization("def calculate_total():")

# Numbers - often split
explore_tokenization("123456789")

# Non-English - more tokens per concept
explore_tokenization("Hello")      # English
explore_tokenization("Bonjour")    # French
explore_tokenization("こんにちは")  # Japanese

Key Takeaways

1. Tokenization is the foundation of every LLM interaction
   - Determines cost, context limits, and model behavior

2. Subword tokenization (BPE) balances vocabulary size with coverage
   - Common words  single tokens
   - Rare words  decomposed to pieces

3. Tokenizers are model-specific
   - Never mix tokenizers and models
   - Same text  different token counts across providers

4. Non-English text is more expensive
   - 3-5x more tokens for same semantic content
   - Important for international applications

5. Token boundaries affect generation
   - "Hello" and " Hello" are different tokens
   - Matters for structured output

Verify Your Understanding

Before proceeding, you should be able to:

Explain why “Hello” and ” Hello” are different tokens — What does whitespace handling in tokenizers tell you about how prompts are represented?

Predict relative token counts for English vs. Japanese — Which will have more tokens? By roughly how much? Why does this matter for cost and performance?

Given a tokenizer, explain how BPE builds the vocabulary — Walk through the merge process.

Identify the tokenization trap: You want the model to output JSON with a specific field name “customerID”. The model sometimes outputs “customer_ID” or “customerId”. What might be happening at the token level?


What’s Next

After this, you can:

  • Continue → Tokens → Embeddings — how tokens become vectors
  • Apply → Build semantic search once you understand embeddings