Speculative Decoding: Draft, Then Verify | Explainers

Read this as How can the target model emit more than one accepted token per pass?

Failure Trap: Thinking the draft model replaces the target model instead of proposing tokens the target verifies.
Decision Rule: Let the draft model guess ahead, then accept only the prefix the target model validates.

1 / ?

One target pass normally emits one token

Autoregressive decoding calls the target model repeatedly, appending one token at a time.

Each accepted token needs a forward pass.
Large target models are expensive.
Latency accumulates over long outputs.

A draft model guesses ahead

A smaller or faster model proposes several candidate tokens from the current prefix.

The draft path is cheap.
It may be wrong.
Its job is proposal, not final authority.

The target verifies in parallel

The target model scores the proposed positions in one pass, checking which draft tokens match its distribution.

The target model remains the judge.
Verification is parallel over the draft span.
Correctness depends on the acceptance rule.

Accepted prefix is emitted

If the target agrees with the draft prefix, multiple tokens can be accepted from one target pass.

Only the consecutive prefix is accepted.
More acceptance means more speedup.
The sequence remains target-model valid.

First rejection is corrected

When a draft token is rejected, the target model supplies the correction and decoding resumes from there.

Later draft tokens are discarded.
The target distribution is preserved.
Bad drafts reduce speedup, not correctness.

More accepted tokens per target pass

The win is throughput: the target model can validate and emit several tokens for the price of fewer target passes.

Speedup depends on draft quality.
The method helps latency and throughput.
It does not make the large model optional.