Attention Mechanism
How self-attention enables transformers to understand context by letting each token attend to all others.
A Sentence as Tokens
The sentence "The cat sat on the mat" is split into tokens, each converted to a vector (embedding). Initially, each token's embedding knows nothing about its neighbors.
"Sat" doesn't know it relates to "cat" or "mat".
- Tokenization: text → tokens
- Embedding: token → vector
- Initially context-independent
Query, Key, Value: Three Roles
Each token is projected into three vectors:
- Query (Q): "What information am I seeking?"
- Key (K): "What information do I have to offer?"
- Value (V): "If selected, here's my content."
These are learned linear transformations (W_Q, W_K, W_V).
Which Tokens Should I Attend To?
To compute attention for "sat", we take its Query and compare to every Key via dot product. High score = similar directions = relevant.
The score is scaled by √d to prevent gradients from vanishing.
- Score = Q · KT
- Higher score = more relevant
- Scale factor √d for stability
Softmax: Normalize to Probabilities
Softmax converts scores to probabilities that sum to 1. "Sat" might attend to "cat" with weight 0.35, "mat" with 0.28, and spread the rest.
This is the attention pattern — what the model "looks at".
- Softmax normalizes to [0, 1]
- Weights sum to 1.0
- Differentiable and learnable
Combine Values by Attention
The output for "sat" is a weighted sum of all Value vectors. Highly-attended tokens contribute more to the result.
Now "sat" has a representation that incorporates context from "cat" and "mat".
- Output = Σ (weight × V)
- Context aggregated from relevant tokens
- Position-agnostic (positional encoding helps)
Self-Attention in Parallel
This happens for all tokens simultaneously — a matrix operation. Multi-head attention runs multiple attention patterns in parallel, letting the model learn different relationships.
This is the heart of transformers.
- Fully parallel computation
- Multi-head: different relationship types
- Stacked layers = deeper understanding