Jun 02, 2026

Speculative Decoding

Trading cheap guesses for expensive forward passes

[TIL post about inference optimisation techniques]

A good way to understand speculative decoding is this:

Speculative decoding is “draft first, verify second.”

Instead of asking the large model to generate one token at a time, you use a smaller, faster model to guess several tokens ahead. Then the large model checks those guesses in one pass.

The goal is simple:

“Can we get the same output as the large model, but with fewer expensive large-model calls?”

The simplest mental model

Imagine two models:

The draft model proposes a chunk of tokens. The target model checks them. If the target model agrees, those tokens are accepted all at once. If it disagrees, generation falls back to the target model's preferred token at the point of disagreement.

Why speculative decoding is needed

Speculative decoding exists because LLM generation is bottlenecked by sequential decoding. During inference, the model usually generates one token at a time, predicting then appending, over and over.

predict token 1
append token 1
predict token 2
append token 2
predict token 3
append token 3

Even if the GPU is powerful, the model cannot freely generate token 10 before it knows tokens 1 to 9. That makes decoding inherently step-by-step. This becomes painful with large models because every token requires another expensive pass through the model.

Concrete example

Suppose the prompt is The capital of France is. Normally the large model generates tokens one by one: Paris, then ., then the next token, and so on. Each is a separate expensive call.

With speculative decoding, the small draft model might quickly propose Paris. in one go. The large target model then checks the whole proposed sequence in a single forward pass. If it would also have generated those tokens, it accepts both. So instead of spending two large-model decoding steps, you spent one large-model verification step plus a cheap draft step.

Why this speeds things up

Autoregressive generation is slow because large language models usually decode sequentially. Each new token depends on the previous one. Speculative decoding exploits a useful fact: smaller models are often good enough to guess many obvious next tokens.

For boring or predictable text, like "Thank you for your email. I will get back to you", the draft model may correctly guess several tokens in a row. The large model does not need to laboriously produce each one. It can validate a batch of proposed tokens at once.

The target model is still the authority. It just runs fewer decoding iterations.

The key mechanism

Speculative decoding has three steps. The draft model proposes k tokens. The large model scores those k tokens in parallel. Then you keep the accepted prefix and replace the first rejected token with the large model's choice.

The loop looks like this:

The important detail

Speculative decoding is not just “use a small model and hope.”

The clever part is that the algorithm can be designed so the final output distribution remains equivalent to sampling from the large target model.

In other words:

Fast draft model = acceleration
Large target model = correctness authority

The draft model suggests. The target model decides.

Analogy: junior writer and editor

Think of the draft model as a junior writer and the target model as a senior editor. The junior writer quickly writes a full sentence. The editor reviews it. If it is fine, the whole sentence is accepted. But the moment the editor hits a word they would have written differently, everything before that point is kept, the disagreement is corrected, and the rest is thrown away.

The junior writer saves time only when broadly aligned with the editor.

Why not just use the small model?

Because the small model is not trusted to produce the final answer. It may be less accurate, less capable, less aligned, more brittle, and worse at reasoning. Speculative decoding uses it only as a proposal engine. The large model still controls the actual distribution of accepted output.

Where the speedup comes from

The speedup depends on the acceptance rate. If the draft proposes five tokens and the target accepts all five, that is a big win. If it accepts only the first one, the gain is small. Everything after the first rejection gets thrown away, so the longer the agreement runs, the more you save.

So speculative decoding works best when:

the draft model is much cheaper than the target model
the draft model is reasonably aligned with the target model (ideally from the same model family)
many next tokens are predictable
verification can be done efficiently in parallel

It works less well when:

the task is highly creative
the target model's distribution is very different from the draft model's
the draft model often guesses wrong
the overhead of drafting and verification eats the benefit

The important correction

A common misunderstanding is that speculative decoding makes the model smarter. It does not. It is an inference optimisation, not a capability improvement. It does not improve reasoning quality, factuality, alignment, or model knowledge. It just tries to generate the same kind of output faster.

Better phrasing:

Speculative decoding improves latency and throughput by using cheap guesses to reduce expensive sequential decoding steps.

💡

Speculative decoding uses a fast draft model to guess several future tokens, then uses the large target model to verify and accept as many of those guesses as possible, reducing the number of expensive decoding steps while preserving the target model as the authority.

Join AI engineers getting weekly insights on agents, RAG & production LLM systems

No spam, no sharing to third party. Only you and me.

Speculative Decoding

by Anup Jadhav

The simplest mental model

Why speculative decoding is needed

Concrete example

Why this speeds things up

The key mechanism

The important detail

Analogy: junior writer and editor

Why not just use the small model?

Where the speedup comes from

The important correction

Member discussion

The simplest mental model

Why speculative decoding is needed

Concrete example

Why this speeds things up

The key mechanism

The important detail

Analogy: junior writer and editor

Why not just use the small model?

Where the speedup comes from

The important correction

More like this

Query, Key, Values

On Durable Objects, Orleans, and prior art for the agentic web

TIL: Ads in AI chatbots are not just a UX problem

How "Thinking" Models Actually Work

We’re Being Too Loose With the Term “World Model”

TIL: Quantisation