Attention Mechanism

How self-attention enables transformers to understand context by letting each token attend to all others.

Based on: Activation Functions Normalization

1 / ?

A Sentence as Tokens

The sentence "The cat sat on the mat" is split into tokens, each converted to a vector (embedding). Initially, each token's embedding knows nothing about its neighbors.

"Sat" doesn't know it relates to "cat" or "mat".

Tokenization: text → tokens
Embedding: token → vector
Initially context-independent

Query, Key, Value: Three Roles

Each token is projected into three vectors:

Query (Q): "What information am I seeking?"
Key (K): "What information do I have to offer?"
Value (V): "If selected, here's my content."

These are learned linear transformations (W_Q, W_K, W_V).

Which Tokens Should I Attend To?

To compute attention for "sat", we take its Query and compare to every Key via dot product. High score = similar directions = relevant.

The score is scaled by √d to prevent gradients from vanishing.

Score = Q · K^T
Higher score = more relevant
Scale factor √d for stability

Softmax: Normalize to Probabilities

Softmax converts scores to probabilities that sum to 1. "Sat" might attend to "cat" with weight 0.35, "mat" with 0.28, and spread the rest.

This is the attention pattern — what the model "looks at".

Softmax normalizes to [0, 1]
Weights sum to 1.0
Differentiable and learnable

Combine Values by Attention

The output for "sat" is a weighted sum of all Value vectors. Highly-attended tokens contribute more to the result.

Now "sat" has a representation that incorporates context from "cat" and "mat".

Output = Σ (weight × V)
Context aggregated from relevant tokens
Position-agnostic (positional encoding helps)

Self-Attention in Parallel

This happens for all tokens simultaneously — a matrix operation. Multi-head attention runs multiple attention patterns in parallel, letting the model learn different relationships.

This is the heart of transformers.

Fully parallel computation
Multi-head: different relationship types
Stacked layers = deeper understanding

What's Next?

Self-attention is the foundation of transformers powering GPT, BERT, and all modern LLMs. Next, explore positional encoding (how transformers know word order), multi-head attention (parallel attention patterns), and transformer architecture (putting it all together).