Attention Mechanism

How self-attention enables transformers to understand context by letting each token attend to all others.

Read this as Which tokens should influence this token's representation?
Failure Trap
Reading attention weights as a faithful human explanation of model reasoning.
Decision Rule
Use attention as learned routing of context; inspect it for behavior, not as standalone proof of intent.
Self-attention in a transformer Step by step: a sentence becomes tokens, each token is projected into Query, Key and Value vectors, attention scores are computed and normalized with softmax, then values are combined into a context-aware output for every token in parallel. A sentence as tokens "The cat sat on the mat" The cat sat on the mat each token → a vector "sat" can't see "cat" or "mat" yet Problem: tokens have no context Query, Key, Value sat Q Query K Key V Value "What am I looking for?" "What do I contain?" "Here's my content" One input → three vectors Learned projections WQ, WK, WV Score "sat" vs all keys sat Query The 0.5 cat 2.8 sat 1.2 on 0.3 mat 2.5 keys · score score = Q · KT / √d Higher product = more relevant "sat" scores highest on "cat", "mat" Softmax → weights sum to 1 0.5 2.8 1.2 0.3 0.4 2.5 softmax cat 0.35 mat 0.28 sat 0.12 the 0.12 rest 0.13 All weights sum to 1.00 Combine values by weight 0.35 × V_cat 0.28 × V_mat … × rest + + Output for "sat" now context-aware "sat" now carries: cat mat rest Output = Σ (weight × value) Every token attends in parallel sat The cat sat on the mat …same runs for every token at once high med low Multi-head: 8 patterns learned in parallel This is the heart of transformers
1 / ?

A Sentence as Tokens

The sentence "The cat sat on the mat" is split into tokens, each converted to a vector (embedding). Initially, each token's embedding knows nothing about its neighbors.

"Sat" doesn't know it relates to "cat" or "mat".

  • Tokenization: text → tokens
  • Embedding: token → vector
  • Initially context-independent

Query, Key, Value: Three Roles

Each token is projected into three vectors:

  • Query (Q): "What information am I seeking?"
  • Key (K): "What information do I have to offer?"
  • Value (V): "If selected, here's my content."

These are learned linear transformations (W_Q, W_K, W_V).

Which Tokens Should I Attend To?

To compute attention for "sat", we take its Query and compare to every Key via dot product. High score = similar directions = relevant.

The score is scaled by √d to prevent gradients from vanishing.

  • Score = Q · KT
  • Higher score = more relevant
  • Scale factor √d for stability

Softmax: Normalize to Probabilities

Softmax converts scores to probabilities that sum to 1. "Sat" might attend to "cat" with weight 0.35, "mat" with 0.28, and spread the rest.

This is the attention pattern — what the model "looks at".

  • Softmax normalizes to [0, 1]
  • Weights sum to 1.0
  • Differentiable and learnable

Combine Values by Attention

The output for "sat" is a weighted sum of all Value vectors. Highly-attended tokens contribute more to the result.

Now "sat" has a representation that incorporates context from "cat" and "mat".

  • Output = Σ (weight × V)
  • Context aggregated from relevant tokens
  • Position-agnostic (positional encoding helps)

Self-Attention in Parallel

This happens for all tokens simultaneously — a matrix operation. Multi-head attention runs multiple attention patterns in parallel, letting the model learn different relationships.

This is the heart of transformers.

  • Fully parallel computation
  • Multi-head: different relationship types
  • Stacked layers = deeper understanding