Speculative Decoding: Draft, Then Verify

See how a fast draft model proposes several tokens and the target model verifies them in parallel.

Read this as How can the target model emit more than one accepted token per pass?
Failure Trap
Thinking the draft model replaces the target model instead of proposing tokens the target verifies.
Decision Rule
Let the draft model guess ahead, then accept only the prefix the target model validates.
Speculative Decoding: Draft, Then Verify See how a fast draft model proposes several tokens and the target model verifies them in parallel. Baseline Baseline target model one token one pass 1 tok/pass Draft K Draft K small model guess tokens cheap path K tokens Verify Verify target scores parallel check same model one pass Accept Accept prefix OK emit many advance context many toks Reject Reject first miss target fixes discard tail correct Speedup Speedup draft cheap target verifies fewer passes >1 tok/pass
1 / ?

One target pass normally emits one token

Autoregressive decoding calls the target model repeatedly, appending one token at a time.

  • Each accepted token needs a forward pass.
  • Large target models are expensive.
  • Latency accumulates over long outputs.

A draft model guesses ahead

A smaller or faster model proposes several candidate tokens from the current prefix.

  • The draft path is cheap.
  • It may be wrong.
  • Its job is proposal, not final authority.

The target verifies in parallel

The target model scores the proposed positions in one pass, checking which draft tokens match its distribution.

  • The target model remains the judge.
  • Verification is parallel over the draft span.
  • Correctness depends on the acceptance rule.

Accepted prefix is emitted

If the target agrees with the draft prefix, multiple tokens can be accepted from one target pass.

  • Only the consecutive prefix is accepted.
  • More acceptance means more speedup.
  • The sequence remains target-model valid.

First rejection is corrected

When a draft token is rejected, the target model supplies the correction and decoding resumes from there.

  • Later draft tokens are discarded.
  • The target distribution is preserved.
  • Bad drafts reduce speedup, not correctness.

More accepted tokens per target pass

The win is throughput: the target model can validate and emit several tokens for the price of fewer target passes.

  • Speedup depends on draft quality.
  • The method helps latency and throughput.
  • It does not make the large model optional.