Skip to content

Ai-engineering Series

Tokens to Embeddings - Vectors That Capture Meaning

Deep dive into embeddings: why one-hot encoding fails, how meaning emerges from training, measuring similarity, and the difference between token and sentence embeddings

Concepts Covered in This Article

Building On Previous Knowledge

In the previous progression, you learned how text becomes a sequence of token IDs. This created a problem: token ID 3797 is just an arbitrary number. It has no inherent relationship to token ID 3798, even if one means “cat” and the other means “kitten.”

Neural networks need representations where similar meanings are mathematically close. Token IDs don’t provide this.

This progression solves that problem by introducing embeddings: learned vectors where semantic similarity becomes geometric proximity.

What Goes Wrong Without This:

Symptom: Semantic search returns irrelevant results despite high scores.
Cause: Embedding model trained on general web text. Your domain
       uses specialized vocabulary the model doesn't understand.

Symptom: Multilingual search works poorly.
Cause: Embedding model trained primarily on English.
       Other languages are poorly aligned in the embedding space.

Symptom: Embedding-based system degrades without code changes.
Cause: You updated the embedding model or the provider did.
       Old embeddings no longer align with new query embeddings.
       This is called "embedding drift."

Token IDs Are Meaningless

After tokenization, you have a sequence of integers:

"The cat sat on the mat"
   [464, 3797, 3332, 319, 262, 2603]

These numbers are arbitrary vocabulary indices.
  Token 3797 = "cat"
  Token 3798 = "catch"  # no meaningful relationship to 3797

The model needs a representation where "cat" and "kitten"
are mathematically close, while "cat" and "democracy" are far apart.

Why One-Hot Encoding Fails

The naive approach: represent each token as a vector with one “1” and all other positions “0.”

Vocabulary size: 50,257 tokens

Token "cat" (ID 3797):
  [0, 0, 0, ..., 1, ..., 0, 0]
                 ^ position 3797

  50,257 dimensions. 50,256 zeros. One "1".

Token "kitten" (ID 4521):
  [0, 0, 0, ..., 1, ..., 0, 0]
                 ^ position 4521

Three fatal problems:

1. No semantic information

Similarity between any two different tokens = 0

dot_product(cat, kitten) = 0     # orthogonal!
dot_product(cat, democracy) = 0  # also orthogonal!

Every pair of words is equally "unrelated."
The representation contains no meaning.

2. Curse of dimensionality

Vocabulary = 50,000  50,000-dimensional vectors
Each vector is 99.998% zeros

Memory: 50,000 x 50,000 x 4 bytes = 10 GB just for embeddings
Compute: mostly multiplying by zero

3. No transfer learning

Learning about "cat" teaches nothing about "kitten"
They're orthogonal—completely independent
Must see every word many times to learn anything about it

Embeddings solve all three:

  • Dense (no wasted dimensions)
  • Semantic (similar words → similar vectors)
  • Transfer (related words share structure)

Embeddings: Vectors That Capture Meaning

An embedding is a dense vector of floating-point numbers representing a concept.

Token ID 3797 ("cat")  [0.23, -0.41, 0.89, 0.12, ..., -0.33]
                         +------------ 768 dimensions -----------+

This vector encodes everything the model learned about "cat":
  - It's an animal
  - It's furry
  - It's a pet
  - It appears in similar contexts as "dog", "kitten", "pet"

All of this compressed into ~768 numbers.

The Embedding Matrix

Models store embeddings in a lookup table:

+---------------------------------------------------------------------+
|                     Embedding Matrix                                 |
|                    (vocab_size x embedding_dim)                      |
|                                                                      |
|  Token ID        Embedding Vector                                   |
|  ---------        ----------------                                   |
|     0           [0.12, -0.34, 0.56, ..., 0.78]    token "the"      |
|     1           [0.23, 0.45, -0.67, ..., 0.89]    token "a"        |
|     2           [-0.11, 0.22, 0.33, ..., -0.44]   token "is"       |
|    ...          ...                                                 |
|  50256          [0.91, -0.82, 0.73, ..., 0.64]    last token       |
|                                                                      |
|  GPT-2: 50,257 tokens x 768 dimensions = 38.6M parameters            |
+---------------------------------------------------------------------+

Lookup is O(1): given token ID, grab that row from the matrix.

How Meaning Emerges

Embeddings aren’t designed. They’re learned from data.

Training objective: predict next token (or masked token)

"The cat sat on the ___"

Model sees millions of examples where:
  - "cat" appears near "dog", "pet", "furry", "meow"
  - "cat" appears after "the", "a", "my"
  - "cat" appears before "sat", "slept", "ran"

Gradient descent adjusts embeddings so:
  - Similar context  similar embeddings
  - "cat" and "kitten" vectors become close
  - "cat" and "democracy" vectors stay far

The famous Word2Vec result:

vector("king") - vector("man") + vector("woman") ≈ vector("queen")

Embeddings capture relationships:
  king:queen :: man:woman
  paris:france :: tokyo:japan

This emerges from context, not explicit programming.

Measuring Similarity

Two vectors are similar if they point in similar directions.

Cosine Similarity

Cosine similarity: measure angle between vectors
  1.0 = identical direction (very similar)
  0.0 = orthogonal (unrelated)
 -1.0 = opposite direction (antonyms, sometimes)

sim("cat", "kitten")     ≈ 0.85   # very similar
sim("cat", "dog")        ≈ 0.75   # related but different
sim("cat", "democracy")  ≈ 0.12   # unrelated

Dot Product

Most embedding models normalize vectors to unit length.
When ||v|| = 1 for all vectors:
  cosine_similarity(a, b) = dot_product(a, b)

This makes similarity computation fast: just matrix multiply.

Euclidean Distance

Euclidean distance: straight-line distance in vector space

distance("cat", "kitten") ≈ 0.3   # close together
distance("cat", "democracy") ≈ 1.8   # far apart

Lower distance = more similar (opposite of cosine)

When to use which:

MetricBest ForNote
Cosine similaritySemantic similarityDirection matters, magnitude doesn’t
Dot productRanking, attentionFaster; magnitude affects result
Euclidean distanceClustering, k-NNPosition in space, not just direction

Rule of thumb: Use cosine similarity for text embeddings. It’s the standard.


Token vs Sentence Embeddings

Two different things, often confused:

+--------------------------------------------------------------------+
|  TOKEN EMBEDDINGS                                                   |
|  ----------------                                                   |
|  One vector per token                                               |
|  "The cat sat"  3 vectors, one for each token                      |
|                                                                     |
|  These are INSIDE the model, between layers.                        |
|  Not directly useful for semantic search.                           |
+--------------------------------------------------------------------+
|  SENTENCE/TEXT EMBEDDINGS                                           |
|  -------------------------                                          |
|  One vector per text chunk                                          |
|  "The cat sat"  1 vector representing whole meaning                |
|                                                                     |
|  These are OUTPUT of specialized embedding models.                  |
|  Used for semantic search, clustering, classification.              |
|                                                                     |
|  Examples: OpenAI text-embedding-3, Cohere embed, sentence-BERT     |
+--------------------------------------------------------------------+

How sentence embeddings are created:

Method 1: Mean pooling (average all token vectors)
  [v1, v2, v3, v4]  (v1 + v2 + v3 + v4) / 4

Method 2: CLS token (use special token's output)
  [CLS] The cat sat [SEP]  use embedding of [CLS]

Method 3: Trained pooler (learned combination)
  Model learns optimal way to combine token embeddings

Modern embedding models use Method 3 with contrastive training.

Embedding Dimensions

Common dimension sizes and their tradeoffs:

+---------------+----------------+-------------+---------------------+
|  Dimensions   |  Memory/Speed  |  Quality    |  Example Models     |
+---------------+----------------+-------------+---------------------+
|  384          |  Fast, small   |  Good       |  all-MiniLM-L6-v2   |
|  768          |  Medium        |  Better     |  BERT, e5-base      |
|  1024         |  Slower        |  Very good  |  e5-large           |
|  1536         |  Slow          |  Excellent  |  text-embedding-3   |
|  3072         |  Very slow     |  Best       |  text-embedding-3-L |
+---------------+----------------+-------------+---------------------+

Higher dimensions = more information capacity
But: diminishing returns, quadratic cost in attention

Practical guidance:

Prototype / cost-sensitive: 384 dims (all-MiniLM)
Production / quality-matters: 768-1024 dims (e5-base/large)
Best quality, cost no object: 1536+ dims (OpenAI large)

Most applications: 768 is the sweet spot.

Contextual vs Static Embeddings

Static embeddings (Word2Vec, GloVe): one vector per word, always.

"I went to the bank to deposit money"
"I sat on the river bank"

Static: "bank"  same vector in both sentences
Problem: completely different meanings!

Contextual embeddings (BERT, GPT, modern): vector depends on context.

"I went to the bank to deposit money"
  "bank"  [financial institution vector]

"I sat on the river bank"
  "bank"  [riverside vector]

Same word, different vectors based on surrounding words.

All modern models use contextual embeddings. Each token’s embedding changes based on what’s around it. This happens through attention (next progression).

Positional Information

Embeddings alone don’t encode position:

"dog bites man"  vs  "man bites dog"

Same tokens, same embeddings, completely different meaning!

Models add positional encoding:

final_embedding = token_embedding + position_embedding

Position embeddings: learned vectors for each position
  Position 0: [0.1, -0.2, ...]
  Position 1: [0.3, 0.4, ...]
  Position 2: [-0.1, 0.5, ...]

Or: sinusoidal functions (original Transformer)
Or: RoPE (rotary position embeddings, modern LLMs)

This lets the model distinguish word order.

Code Example

Using a real embedding model to see embeddings in action:

from sentence_transformers import SentenceTransformer
import numpy as np

# Load a popular embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """Compute cosine similarity between two vectors."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Get embeddings
texts = [
    "The cat sat on the mat",
    "A kitten was resting on the rug",
    "Python is a programming language",
    "I love machine learning",
]

embeddings = model.encode(texts)

print(f"Embedding shape: {embeddings.shape}")  # (4, 384)
print(f"First embedding (first 10 dims): {embeddings[0][:10]}")

# Compare similarities
print("\nSimilarity matrix:")
for i, text_i in enumerate(texts):
    for j, text_j in enumerate(texts):
        sim = cosine_similarity(embeddings[i], embeddings[j])
        print(f"  [{i}][{j}] {sim:.3f}", end="")
    print(f"  ← {text_i[:30]}...")

# Expected output:
# Cat sentences similar to each other, different from Python/ML

Key Takeaways

1. Token IDs are arbitrary integers with no inherent meaning

2. Embeddings are dense vectors (384-3072 dims) encoding semantics

3. Meaning emerges from training on context, not explicit rules

4. Similar meaning  similar vectors (measurable with cosine similarity)

5. Modern embeddings are contextual: same word, different vector based on context

6. Position is added separately (positional encoding)

7. Token embeddings ≠ sentence embeddings
   - Token: one vector per word, inside model
   - Sentence: one vector per text, output of embedding model

Verify Your Understanding

Before proceeding, you should be able to:

Explain “king - man + woman = queen” without using the words “vector” or “embedding” — A genuine explanation might involve: “Words that appear in similar contexts develop similar internal representations…”

Why do contextual embeddings solve a problem that static embeddings have? — Give a specific example sentence where static embeddings fail.

Given these two sentences:

  • “The bank was closed for the holiday”
  • “The river bank was eroded by flooding”

Will a static embedding model give “bank” the same vector in both? Will a contextual model? Why does this matter?

Your embedding model gives similarity = 0.92 for two texts. What does this tell you? List at least two things it does NOT tell you.


What’s Next

After this, you can:

  • Continue → Embeddings → Attention — how tokens “look at” each other
  • Build → Semantic search with what you’ve learned