Positional Encoding: Order via Rotation

See why attention needs position information and how RoPE turns token position into relative rotation.

Read this as How does the model know token order?
Failure Trap
Assuming self-attention automatically knows sequence order without positional signal.
Decision Rule
Inject position before attention; with RoPE, rotate Q and K so dot products encode relative offsets.
Positional Encoding: Order via Rotation See why attention needs position information and how RoPE turns token position into relative rotation. Order missing Order missing same tokens shuffled ambiguous need pos Add waves Add waves sin cos many scales absolute Rotate Rotate Q and K m theta by position RoPE Position pair Position pair token m token n different angles m,n Relative score Relative score Q dot K depends on n-m order-aware offset Multi-scale Multi-scale fast turns slow turns many ranges scales
1 / ?

Self-attention alone has no order

If token representations carry no positional signal, attention sees a bag of token vectors rather than an ordered sequence.

  • Permutation changes should matter for language.
  • Attention scores need position somewhere.
  • The signal is injected before or inside attention.

Sinusoidal encodings add position

Classic positional encodings add sine and cosine features at multiple frequencies to token embeddings.

  • Each position has a pattern.
  • Frequencies cover short and long ranges.
  • The vector now carries order information.

RoPE rotates query and key features

Rotary position embeddings rotate pairs of Q and K dimensions by an angle tied to token position.

  • Position m maps to angle m times theta.
  • The value vectors are not the focus.
  • Rotation happens before attention scores.

Two tokens carry different rotations

Tokens at positions m and n receive different rotations, even if their content vectors are similar.

  • The content remains in the vector.
  • The orientation now encodes location.
  • Pairs can compare relative offset.

The dot product reads relative position

When rotated Q and K vectors are compared, the score depends on the relative offset between positions.

  • Attention keeps content matching.
  • The score also carries distance information.
  • This makes relative order available to heads.

Multiple frequencies cover multiple scales

Different rotation frequencies let the model represent nearby and longer-range position patterns.

  • Short scales help local syntax.
  • Long scales help broader context.
  • The positional signal travels through attention.