Multi-Head Attention: Many Patterns at Once

See how multiple attention heads can track different relationships before their outputs are concatenated and projected.

Read this as Which relationships can different heads attend to in parallel?
Failure Trap
Treating multi-head attention as duplicated attention rather than separate learned projections.
Decision Rule
Split Q/K/V into heads, let each head mix values, then concatenate and project through W_O.
Multi-Head Attention: Many Patterns at Once See how multiple attention heads can track different relationships before their outputs are concatenated and projected. One pattern One pattern Q to K attention weighted V head Split heads Split heads h heads smaller dims parallel h Patterns Patterns syntax coreference position cues relations Head outputs Head outputs head A head B head H value mix Concat Concat join heads full width one vector d_model Output proj Output proj W_O blend heads update stream residual
1 / ?

One head learns one attention pattern

A single attention head scores how much each token should read from every other token.

  • Queries compare against keys.
  • Scores become weights.
  • Weights mix value vectors.

Multi-head splits the representation

The model projects Q, K, and V into several smaller head spaces instead of one large attention space.

  • Heads have separate learned projections.
  • They run in parallel.
  • Each sees the same sequence through a different lens.

Heads can specialize

One head might track syntax, while another tracks coreference or long-range dependency patterns.

  • Specialization is learned, not assigned.
  • Do not over-read one head as a proof.
  • The benefit is multiple relation channels.

Each head produces weighted values

Every head computes its own weighted sum of value vectors for each token position.

  • The outputs have compatible shapes.
  • Each output carries a different context mix.
  • No head is the whole answer.

Outputs concatenate

The per-head outputs are concatenated back into one vector for each token.

  • Concatenation restores the model width.
  • All heads contribute side by side.
  • The next projection blends them.

W_O projects the combined result

The output projection mixes the concatenated heads into the residual stream update.

  • The block receives one update vector per token.
  • The projection learns how to blend heads.
  • Then Add and Norm takes over.