Read this as Which relationships can different heads attend to in parallel?
- Failure Trap
- Treating multi-head attention as duplicated attention rather than separate learned projections.
- Decision Rule
- Split Q/K/V into heads, let each head mix values, then concatenate and project through W_O.
One head learns one attention pattern
A single attention head scores how much each token should read from every other token.
- Queries compare against keys.
- Scores become weights.
- Weights mix value vectors.
Multi-head splits the representation
The model projects Q, K, and V into several smaller head spaces instead of one large attention space.
- Heads have separate learned projections.
- They run in parallel.
- Each sees the same sequence through a different lens.
Heads can specialize
One head might track syntax, while another tracks coreference or long-range dependency patterns.
- Specialization is learned, not assigned.
- Do not over-read one head as a proof.
- The benefit is multiple relation channels.
Each head produces weighted values
Every head computes its own weighted sum of value vectors for each token position.
- The outputs have compatible shapes.
- Each output carries a different context mix.
- No head is the whole answer.
Outputs concatenate
The per-head outputs are concatenated back into one vector for each token.
- Concatenation restores the model width.
- All heads contribute side by side.
- The next projection blends them.
W_O projects the combined result
The output projection mixes the concatenated heads into the residual stream update.
- The block receives one update vector per token.
- The projection learns how to blend heads.
- Then Add and Norm takes over.