Read this as Which attention tensors can be reused from prior tokens?
- Failure Trap
- Recomputing the entire prefix for every new token and mistaking that for inherent generation cost.
- Decision Rule
- Cache prior keys and values; compute the new query and append fresh K/V for the current token.
The first token computes everything
On the first generated token, the model computes keys and values for every prompt token.
- There is no prior cache yet.
- Attention needs K and V for the prefix.
- The cache is filled as a side effect.
Naive decoding repeats old work
Without a cache, token two would recompute keys and values for tokens the model already processed.
- The prefix grows every step.
- Old token K/V does not change.
- Recompute wastes attention work.
The cache stores prior keys and values
The KV cache keeps each layer's key and value tensors for tokens already seen.
- Cache size grows with context length.
- It trades memory for speed.
- It is per request and per layer.
The next token reuses old K/V
For the current token, the model computes a fresh query and attends against cached keys and values.
- Old keys and values are read, not rebuilt.
- The new query still sees the prefix.
- The output distribution stays equivalent.
Append the new token's K/V
After the current token is processed, its own key and value are appended to the cache for the next step.
- The cache extends by one token.
- Future tokens can attend to it.
- Long contexts become memory-heavy.
Incremental work grows linearly
The cache changes decoding from repeated prefix recompute to incremental work per new token.
- Compute per step is much lower.
- Memory bandwidth becomes important.
- This is why streaming tokens can be fast.