Read this as How does the model know token order?
- Failure Trap
- Assuming self-attention automatically knows sequence order without positional signal.
- Decision Rule
- Inject position before attention; with RoPE, rotate Q and K so dot products encode relative offsets.
Self-attention alone has no order
If token representations carry no positional signal, attention sees a bag of token vectors rather than an ordered sequence.
- Permutation changes should matter for language.
- Attention scores need position somewhere.
- The signal is injected before or inside attention.
Sinusoidal encodings add position
Classic positional encodings add sine and cosine features at multiple frequencies to token embeddings.
- Each position has a pattern.
- Frequencies cover short and long ranges.
- The vector now carries order information.
RoPE rotates query and key features
Rotary position embeddings rotate pairs of Q and K dimensions by an angle tied to token position.
- Position m maps to angle m times theta.
- The value vectors are not the focus.
- Rotation happens before attention scores.
Two tokens carry different rotations
Tokens at positions m and n receive different rotations, even if their content vectors are similar.
- The content remains in the vector.
- The orientation now encodes location.
- Pairs can compare relative offset.
The dot product reads relative position
When rotated Q and K vectors are compared, the score depends on the relative offset between positions.
- Attention keeps content matching.
- The score also carries distance information.
- This makes relative order available to heads.
Multiple frequencies cover multiple scales
Different rotation frequencies let the model represent nearby and longer-range position patterns.
- Short scales help local syntax.
- Long scales help broader context.
- The positional signal travels through attention.