Speculative Decoding

Trading cheap guesses for expensive forward passes

[TIL post about inference optimisation techniques]

A good way to understand speculative decoding is this:

Speculative decoding is “draft first, verify second.”

Instead of asking the large model to generate one token at a time, you use a smaller, faster model to guess several tokens ahead. Then the large model checks those guesses in one pass.

The goal is simple:

“Can we get the same output as the large model, but with fewer expensive large-model calls?”

The simplest mental model

Imagine two models:

Model Role Intuition Draft model Fast, small Proposes tokens Quickly guesses the next few tokens Target model Large, expensive Verifies tokens Checks whether those guesses are acceptable

The draft model proposes a chunk of tokens. The target model checks them. If the target model agrees, those tokens are accepted all at once. If it disagrees, generation falls back to the target model's preferred token at the point of disagreement.

Why speculative decoding is needed

Speculative decoding exists because LLM generation is bottlenecked by sequential decoding. During inference, the model usually generates one token at a time, predicting then appending, over and over.

predict token 1
append token 1
predict token 2
append token 2
predict token 3
append token 3

Even if the GPU is powerful, the model cannot freely generate token 10 before it knows tokens 1 to 9. That makes decoding inherently step-by-step. This becomes painful with large models because every token requires another expensive pass through the model.

Concrete example

Suppose the prompt is The capital of France is. Normally the large model generates tokens one by one: Paris, then ., then the next token, and so on. Each is a separate expensive call.

With speculative decoding, the small draft model might quickly propose Paris. in one go. The large target model then checks the whole proposed sequence in a single forward pass. If it would also have generated those tokens, it accepts both. So instead of spending two large-model decoding steps, you spent one large-model verification step plus a cheap draft step.

Why this speeds things up

Autoregressive generation is slow because large language models usually decode sequentially. Each new token depends on the previous one. Speculative decoding exploits a useful fact: smaller models are often good enough to guess many obvious next tokens.

For boring or predictable text, like "Thank you for your email. I will get back to you", the draft model may correctly guess several tokens in a row. The large model does not need to laboriously produce each one. It can validate a batch of proposed tokens at once.

Sequential decoding Five expensive target-model passes, one per token tok 1 tok 2 tok 3 tok 4 tok 5 Speculative decoding One cheap draft pass, then one target pass checks all five Draft modelProposes: tok 1 2 3 4 5 Target modelVerifies all five at once Same five tokens. Five slow passes become one.
The target model is still the authority. It just runs fewer decoding iterations.

The key mechanism

Speculative decoding has three steps. The draft model proposes k tokens. The large model scores those k tokens in parallel. Then you keep the accepted prefix and replace the first rejected token with the large model's choice.

The loop looks like this:

The speculative decoding loop: draft, verify, accept or correctA small draft model proposes several tokens, the large target model verifies them in one pass, accepted tokens are kept and the loop continues. Current text Tokens so far Draft model Guess k tokens fast Target model Verify in one pass Accept or correct Accepted tokens become the new current text, then repeat

The important detail

Speculative decoding is not just “use a small model and hope.”

The clever part is that the algorithm can be designed so the final output distribution remains equivalent to sampling from the large target model.

In other words:

Fast draft model = acceleration
Large target model = correctness authority

The draft model suggests. The target model decides.

Analogy: junior writer and editor

Think of the draft model as a junior writer and the target model as a senior editor. The junior writer quickly writes a full sentence. The editor reviews it. If it is fine, the whole sentence is accepted. But the moment the editor hits a word they would have written differently, everything before that point is kept, the disagreement is corrected, and the rest is thrown away.

Draft proposes moved to Friday afternoon Target verifies movedaccept toaccept Thursdaycorrect afternoondiscarded First mismatch ends this round
The junior writer saves time only when broadly aligned with the editor.

Why not just use the small model?

Because the small model is not trusted to produce the final answer. It may be less accurate, less capable, less aligned, more brittle, and worse at reasoning. Speculative decoding uses it only as a proposal engine. The large model still controls the actual distribution of accepted output.

Where the speedup comes from

The speedup depends on the acceptance rate. If the draft proposes five tokens and the target accepts all five, that is a big win. If it accepts only the first one, the gain is small. Everything after the first rejection gets thrown away, so the longer the agreement runs, the more you save.

So speculative decoding works best when:

  • the draft model is much cheaper than the target model
  • the draft model is reasonably aligned with the target model (ideally from the same model family)
  • many next tokens are predictable
  • verification can be done efficiently in parallel

It works less well when:

  • the task is highly creative
  • the target model's distribution is very different from the draft model's
  • the draft model often guesses wrong
  • the overhead of drafting and verification eats the benefit

The important correction

A common misunderstanding is that speculative decoding makes the model smarter. It does not. It is an inference optimisation, not a capability improvement. It does not improve reasoning quality, factuality, alignment, or model knowledge. It just tries to generate the same kind of output faster.

Better phrasing:

Speculative decoding improves latency and throughput by using cheap guesses to reduce expensive sequential decoding steps.

💡
Speculative decoding uses a fast draft model to guess several future tokens, then uses the large target model to verify and accept as many of those guesses as possible, reducing the number of expensive decoding steps while preserving the target model as the authority.
Join AI engineers getting weekly insights on agents, RAG & production LLM systems

No spam, no sharing to third party. Only you and me.

Member discussion