Multi-Head Attention: Many Patterns at Once | Explainers

Read this as Which relationships can different heads attend to in parallel?

Failure Trap: Treating multi-head attention as duplicated attention rather than separate learned projections.
Decision Rule: Split Q/K/V into heads, let each head mix values, then concatenate and project through W_O.

1 / ?

One head learns one attention pattern

A single attention head scores how much each token should read from every other token.

Queries compare against keys.
Scores become weights.
Weights mix value vectors.

Multi-head splits the representation

The model projects Q, K, and V into several smaller head spaces instead of one large attention space.

Heads have separate learned projections.
They run in parallel.
Each sees the same sequence through a different lens.

Heads can specialize

One head might track syntax, while another tracks coreference or long-range dependency patterns.

Specialization is learned, not assigned.
Do not over-read one head as a proof.
The benefit is multiple relation channels.

Each head produces weighted values

Every head computes its own weighted sum of value vectors for each token position.

The outputs have compatible shapes.
Each output carries a different context mix.
No head is the whole answer.

Outputs concatenate

The per-head outputs are concatenated back into one vector for each token.

Concatenation restores the model width.
All heads contribute side by side.
The next projection blends them.

W_O projects the combined result

The output projection mixes the concatenated heads into the residual stream update.

The block receives one update vector per token.
The projection learns how to blend heads.
Then Add and Norm takes over.