Geometric intuitions for vectors, cosine similarity, dot products, and matrix multiplication in AI
65% of ML interviews
Powers systems at Understanding embeddings and attention
Foundation for retrieval systems query improvement
TL;DR
Embeddings are vectors in high-dimensional space where similar meanings cluster together. Understanding dot products, cosine similarity, and matrix multiplication is essential for working with embeddings and attention mechanisms.
Visual Overview
EMBEDDING SPACE (visualized in 2D, real embeddings are 384-4096 dims)
+-----------------------------------------------------------+
| |
| | |
| cat * * dog |
| | |
| | * puppy |
| | |
| ------+-------------------------- |
| | |
| | * car * truck |
| | |
| | * vehicle |
| |
| Semantic similarity = geometric proximity |
| "cat" is closer to "dog" than to "car" |
| |
+-----------------------------------------------------------+
Key insight: When a model converts text to embeddings, it’s placing words/sentences at coordinates in a space where distance = meaning difference.
What Dimensions Represent
DIMENSIONS
+-----------------------------------------------------------+
| |
| Each dimension captures some learned feature. |
| |
| Hypothetical (models don't label dimensions): |
| Dimension 1: animate vs inanimate |
| Dimension 2: size |
| Dimension 3: domesticated vs wild |
| ... |
| Dimension 768: ??? |
| |
| In practice: Dimensions aren't interpretable |
| individually. The geometry of relationships is what |
| matters. |
| |
+-----------------------------------------------------------+
Dot Product
The dot product is the fundamental operation in neural networks. Attention, similarity, and layer computations all use it.
DOT PRODUCT FORMULA
+-----------------------------------------------------------+
| |
| a . b = SUM(a_i x b_i) |
| |
| Example (3D vectors): |
| a = [3, 4, 0] |
| b = [2, 1, 2] |
| |
| a . b = (3x2) + (4x1) + (0x2) = 6 + 4 + 0 = 10 |
| |
+-----------------------------------------------------------+
GEOMETRIC MEANING
+-----------------------------------------------------------+
| |
| a . b = |a| x |b| x cos(theta) |
| |
| Where: |
| |a| = length of vector a |
| |b| = length of vector b |
| theta = angle between vectors |
| |
+-----------------------------------------------------------+
DOT PRODUCT SIGN
+-----------------------------------------------------------+
| |
| b |
| ^ |
| | a.b > 0 (similar direction) |
| -----*-----> a |
| | |
| | a.b < 0 (opposite direction) |
| v |
| |
| Same direction (0 deg): cos(0) = 1 -> positive |
| Perpendicular (90 deg): cos(90) = 0 -> zero |
| Opposite (180 deg): cos(180) = -1 -> negative |
| |
+-----------------------------------------------------------+
In attention: Query . Key computes relevance. High dot product = this key is relevant to this query.
Cosine Similarity
Cosine similarity is a normalized dot product. It measures direction alignment, ignoring magnitude.
COSINE SIMILARITY
+-----------------------------------------------------------+
| |
| cos_sim(a, b) = (a . b) / (|a| x |b|) |
| |
| Range: [-1, 1] |
| 1.0 = identical direction (parallel) |
| 0.0 = orthogonal (unrelated) |
| -1.0 = opposite direction (antonyms, in some spaces) |
| |
+-----------------------------------------------------------+
WHY NORMALIZE?
+-----------------------------------------------------------+
| |
| WITHOUT NORMALIZATION: |
| |
| Vector lengths vary: |
| "king" might have ||v|| = 10 |
| "queen" might have ||v|| = 8 |
| |
| Raw dot product: |
| king . queen = 75 |
| king . dog = 80 <-- Higher! But "dog" isn't |
| more similar |
| |
| Problem: Length dominates, not direction. |
| |
| WITH NORMALIZATION: |
| |
| Cosine similarity: |
| cos(king, queen) = 0.95 |
| cos(king, dog) = 0.30 |
| |
| Now direction dominates. "queen" is more similar. |
| |
+-----------------------------------------------------------+
In practice: Most embedding models output normalized vectors (length = 1). When vectors are normalized, dot product = cosine similarity.
Distance Metrics
Euclidean Distance (L2)
EUCLIDEAN DISTANCE
+-----------------------------------------------------------+
| |
| d(a, b) = sqrt(SUM((a_i - b_i)^2)) |
| |
| "Straight line" distance in space. |
| |
| | |
| a * |
| |\ |
| | \ d = 5 |
| | \ |
| | * b |
| ------+------- |
| |
+-----------------------------------------------------------+
Cosine Distance
COSINE DISTANCE
+-----------------------------------------------------------+
| |
| cos_dist(a, b) = 1 - cos_sim(a, b) |
| |
| Range: [0, 2] |
| 0 = identical direction |
| 1 = orthogonal |
| 2 = opposite direction |
| |
+-----------------------------------------------------------+
When to Use What
| Metric | When | Why |
|---|---|---|
| Cosine similarity | Text embeddings | Direction = semantic meaning |
| Cosine distance | Retrieval ranking | Lower = more similar |
| Euclidean (L2) | Some image embeddings | Magnitude can carry info |
| Dot product | Normalized vectors | Fast, equals cosine sim |
Default choice: Cosine similarity for text. It’s what embedding models are trained to optimize.
Matrix Multiplication
Neural networks are stacks of matrix multiplications. Understanding this operation clarifies how models transform representations.
MATRIX x VECTOR = NEW VECTOR
+-----------------------------------------------------------+
| |
| [ 2 0 ] [ 3 ] [ 6 ] |
| [ ] x [ ] = [ ] |
| [ 0 3 ] [ 2 ] [ 6 ] |
| |
| This matrix scales x by 2, y by 3. |
| |
+-----------------------------------------------------------+
TRANSFORMATION VIEW
+-----------------------------------------------------------+
| |
| A matrix defines a transformation of space. |
| Multiplying transforms points. |
| |
| Rotation: Points rotate around origin |
| Scaling: Points stretch/compress |
| Projection: Higher-dim -> lower-dim |
| Combination: All of the above |
| |
| Neural network layer = matrix multiply + activation |
| Each layer transforms the representation into a new |
| space. |
| |
+-----------------------------------------------------------+
DIMENSION CHANGES
+-----------------------------------------------------------+
| |
| Matrix shape: (output_dim, input_dim) |
| Vector shape: (input_dim,) |
| Result shape: (output_dim,) |
| |
| Example: |
| Input embedding: 768 dimensions |
| Weight matrix: (3072, 768) |
| Output: 3072 dimensions <-- expanded |
| |
| Transformer FFN: 768 -> 3072 -> 768 (expand then |
| compress) |
| |
+-----------------------------------------------------------+
In Attention
Attention is built from these primitives:
ATTENTION COMPUTATION
+-----------------------------------------------------------+
| |
| 1. Project inputs to Q, K, V spaces: |
| Q = X @ W_Q (768 -> 64 per head) |
| K = X @ W_K |
| V = X @ W_V |
| |
| 2. Compute attention scores: |
| scores = Q @ K.T <-- Dot products between all |
| Q-K pairs |
| |
| 3. Scale and softmax: |
| weights = softmax(scores / sqrt(d_k)) |
| |
| 4. Weighted sum of values: |
| output = weights @ V |
| |
| Each operation is dot products or matrix multiplies. |
| |
+-----------------------------------------------------------+
Why projections? Different W_Q, W_K, W_V let the model learn different “views” of the input. Query projection emphasizes “what am I looking for?” Key projection emphasizes “what do I contain?” Value projection emphasizes “what information should I contribute?”
Dimensionality and Capacity
More dimensions = more capacity to represent distinctions.
DIMENSIONALITY TRADEOFF
+-----------------------------------------------------------+
| |
| 384D: Good separation for many tasks |
| - Fast inference |
| - Small storage |
| - May lose fine distinctions |
| |
| 768D: Rich separation (BERT-sized) |
| - "bank" (financial) far from "bank" (river) |
| - Nuanced relationships preserved |
| |
| 4096D: Maximum expressiveness |
| - Captures subtle distinctions |
| - Expensive to compute and store |
| |
+-----------------------------------------------------------+
Common dimensions:
- Small/fast: 384 (e5-small, all-MiniLM)
- Standard: 768 (BERT, many embedding models)
- Large: 1024-4096 (GPT-scale, high-quality embeddings)
When This Matters
| Situation | Concept to apply |
|---|---|
| Choosing an embedding model | Dimensionality tradeoff |
| Understanding retrieval | Cosine similarity for ranking |
| Understanding attention | Q.K dot products, softmax, V weighting |
| Debugging “wrong results returned” | Check distance metric matches model |
| Understanding layer transformations | Matrix multiply as space transformation |
| Optimizing inference | Dot products are the computational bottleneck |