KV-Cache: Why the 2nd Token Is Faster | Explainers

Read this as Which attention tensors can be reused from prior tokens?

Failure Trap: Recomputing the entire prefix for every new token and mistaking that for inherent generation cost.
Decision Rule: Cache prior keys and values; compute the new query and append fresh K/V for the current token.

1 / ?

The first token computes everything

On the first generated token, the model computes keys and values for every prompt token.

There is no prior cache yet.
Attention needs K and V for the prefix.
The cache is filled as a side effect.

Naive decoding repeats old work

Without a cache, token two would recompute keys and values for tokens the model already processed.

The prefix grows every step.
Old token K/V does not change.
Recompute wastes attention work.

The cache stores prior keys and values

The KV cache keeps each layer's key and value tensors for tokens already seen.

Cache size grows with context length.
It trades memory for speed.
It is per request and per layer.

The next token reuses old K/V

For the current token, the model computes a fresh query and attends against cached keys and values.

Old keys and values are read, not rebuilt.
The new query still sees the prefix.
The output distribution stays equivalent.

Append the new token's K/V

After the current token is processed, its own key and value are appended to the cache for the next step.

The cache extends by one token.
Future tokens can attend to it.
Long contexts become memory-heavy.

Incremental work grows linearly

The cache changes decoding from repeated prefix recompute to incremental work per new token.

Compute per step is much lower.
Memory bandwidth becomes important.
This is why streaming tokens can be fast.