Positional Encoding: Order via Rotation | Explainers

Read this as How does the model know token order?

Failure Trap: Assuming self-attention automatically knows sequence order without positional signal.
Decision Rule: Inject position before attention; with RoPE, rotate Q and K so dot products encode relative offsets.

1 / ?

Self-attention alone has no order

If token representations carry no positional signal, attention sees a bag of token vectors rather than an ordered sequence.

Permutation changes should matter for language.
Attention scores need position somewhere.
The signal is injected before or inside attention.

Sinusoidal encodings add position

Classic positional encodings add sine and cosine features at multiple frequencies to token embeddings.

Each position has a pattern.
Frequencies cover short and long ranges.
The vector now carries order information.

RoPE rotates query and key features

Rotary position embeddings rotate pairs of Q and K dimensions by an angle tied to token position.

Position m maps to angle m times theta.
The value vectors are not the focus.
Rotation happens before attention scores.

Two tokens carry different rotations

Tokens at positions m and n receive different rotations, even if their content vectors are similar.

The content remains in the vector.
The orientation now encodes location.
Pairs can compare relative offset.

The dot product reads relative position

When rotated Q and K vectors are compared, the score depends on the relative offset between positions.

Attention keeps content matching.
The score also carries distance information.
This makes relative order available to heads.

Multiple frequencies cover multiple scales

Different rotation frequencies let the model represent nearby and longer-range position patterns.

Short scales help local syntax.
Long scales help broader context.
The positional signal travels through attention.