Draft Model Selection for Speculative Decoding
We have written about vanilla speculative decoding and the next generation of speculative decoding methods. The papers describe the algorithms, but they tend to gloss over the part that actually decides whether your deployment gets a 3x speedup or a 1.1x speedup: which draft model you pick.
Choosing a draft model looks simple on paper. Pick something smaller than your target. Run it. Verify in bulk. In practice, the choice involves at least four trade-offs, and getting any of them wrong wastes most of the potential gain. This post is the practical guide we wish someone had handed us when we first put speculative decoding into production.
The basic math: speedup is acceptance rate times pass count
The speedup from speculative decoding is roughly:
expected_speedup ≈ (1 + α + α^2 + ... + α^k) / (1 + c)
Where α is the per-token acceptance rate, k is the number of draft tokens proposed per round, and c is the cost of the draft pass relative to the target pass. The numerator is how many tokens you get per target forward pass on average. The denominator accounts for the fact that drafting itself is not free.
This formula is worth staring at for a minute, because it makes every selection trade-off concrete. A draft model with 90% acceptance and 5% draft cost crushes a draft model with 70% acceptance and 1% draft cost, even though the second one is much smaller. A draft model with 95% acceptance that costs 30% of a target pass loses to a 75% acceptance draft model that costs 4% of a target pass. There is no single correct answer. The correct answer depends on what acceptance rate you can actually achieve and how cheap your draft is compared to the target.
Trade-off 1: size
The most common rule of thumb is "draft should be 10x to 30x smaller than the target." That is roughly correct, but it hides what is really going on.
What you want is a draft that is cheap enough to run that even modest acceptance rates are profitable. On a Llama 3.1 70B target, a Llama 3.2 1B draft typically runs in around 5% of the target's forward pass time, so even an acceptance rate of 60% gives a meaningful speedup. A 7B draft might hit 80% acceptance, but it costs 12 to 15% of the target pass, and the trade-off often comes out worse.
There is also a hard floor. Below about 500M parameters, draft quality on real prompts (especially code, chain of thought, structured output) drops off a cliff. The acceptance rate falls into the 30 to 50% range, the speedup collapses, and you would have been better off without speculation at all. TinyLlama 1.1B is roughly the smallest model worth using as a general-purpose draft for production traffic.
The practical sizing window for general-purpose draft models in 2026 is 1B to 3B parameters when the target is 30B or larger. Below 30B, the draft cost becomes a much bigger fraction of the target pass and you need to be more careful.
Trade-off 2: vocabulary and family alignment
This one bites people who try to mix and match models. Speculative decoding requires the draft and target to share a tokenizer. If they tokenize differently, you have to translate proposed tokens between vocabularies, and the verification step gets messy. Most production deployments avoid this by sticking to drafts and targets from the same model family.
Beyond tokenizer alignment, family alignment also matters for acceptance rate. A Qwen 2.5 1.5B draft for a Qwen 2.5 72B target hits 75 to 85% acceptance on most prompts, because both models were trained on overlapping data with similar objectives. A Llama 3.2 1B draft for a Qwen 2.5 72B target, even after retokenization tricks, tops out around 50 to 60% because the two models disagree about token distributions in subtle but consistent ways.
The general guideline:
- Same family, same generation: best acceptance rate. Use this when available.
- Same family, different generation (e.g. Llama 3.2 draft with Llama 3.1 target): usually fine, expect a few percentage points lower acceptance.
- Different families: only when forced. The drop in acceptance is rarely worth it.
Trade-off 3: distillation
For a long time, the conventional wisdom was that you should distill your draft model from your target. The intuition makes sense. A distilled draft has been trained to mimic the target's exact output distribution, so the acceptance rate should be higher than an off-the-shelf small model.
In practice, the gain from distillation is real but smaller than people expect, usually 5 to 10 percentage points of acceptance rate. That is enough to be worth doing if you serve at scale, but not enough to bother with for most deployments. The cost is that you now have a custom draft model that needs to be retrained every time your target model changes, which in 2026 is a meaningful operational burden.
The exception is domain-specialized serving. If you serve mostly code, or mostly customer support chats, or mostly structured tool calls, distilling a draft model on traffic from your domain pushes acceptance rates into the 90% range. At that point the trade-off shifts. We have seen production code-completion deployments where a 1B distilled draft hits 92 to 94% acceptance against a 32B target, which is hard to beat with any off-the-shelf model.
Trade-off 4: quantization of the draft
Most people quantize the target model and forget about the draft. This is a mistake. The draft model's forward pass cost shows up directly in the speedup formula, and quantizing the draft (FP8 or INT4) cuts that cost roughly in half with minimal acceptance rate loss.
The reason quantization is safer on the draft than on the target is that you do not actually need the draft to be accurate. You need it to propose tokens that the target will accept. Even if INT4 quantization shaves a few points off the draft's standalone perplexity, the verification step catches any divergent tokens, so the only cost is a slightly lower acceptance rate. In our experience that cost is usually 2 to 4 percentage points, while the latency savings are 30 to 50%.
If your target is FP8 or BF16 and your draft is also full precision, you are leaving easy speedup on the table. Quantize the draft.
How to actually measure your setup
Two numbers tell you almost everything:
-
Acceptance rate (α): the fraction of draft tokens that survive verification. Measure on real production traffic, not on benchmark prompts. Acceptance rate on MMLU-style multiple choice can be 20 percentage points higher than acceptance rate on free-form chat. Use what you actually serve.
-
Mean accepted length per round: how many tokens you commit per target forward pass on average. This is the metric your latency depends on. With
kdraft tokens proposed, mean accepted length is(1 - α^(k+1)) / (1 - α). The marginal benefit of more draft tokens decreases fast as α drops.
If your acceptance rate is below 65%, you have probably picked the wrong draft model. Go look at where rejections happen. Are they early in the response (which suggests a tokenizer or prompting mismatch) or late (which suggests the draft is fine for short patterns but loses coherence on longer continuations)?
If your acceptance rate is above 90% and you are still not seeing the speedup you expected, your draft pass is too expensive. Quantize it, shrink it, or look at whether you have set k too high (proposing 8 tokens when you only ever accept 3 wastes draft compute).
Picking k, the number of proposed tokens
The right value of k depends on α. As a rough guide:
- α around 60%: k = 3 or 4
- α around 75%: k = 4 or 6
- α around 85%: k = 6 or 8
- α above 90%: k = 8 or higher, sometimes a tree structure helps more than a linear chain
If you are using EAGLE-2 or Sequoia, the algorithm picks the tree structure for you based on confidence, so you mostly stop worrying about k as a fixed parameter. For vanilla speculative decoding with a draft model, picking k is still a manual tuning step.
When to skip speculative decoding entirely
Speculative decoding helps in latency-bound serving (low concurrency, single-user requests, voice agents, autocomplete). It helps less, and sometimes hurts, in throughput-bound serving (large batch sizes, offline inference, batch jobs).
The reason is that speculative decoding fundamentally trades extra compute for fewer sequential dependencies. When you are batching 64 requests, the GPU is already saturated on compute for every forward pass. Adding speculation does not buy you parallelism you did not already have, and the verification overhead can actually slow things down.
Rule of thumb: if your time-to-first-token matters more than your tokens-per-dollar, speculative decoding is probably worth it. If you are running offline summarization on millions of documents and tokens-per-dollar is the only metric, it usually is not.
What this looks like at General Compute
The reason draft model selection matters so much on GPU is that the target forward pass is slow. When a single decode step takes 70 milliseconds, every additional token you can squeeze out of that pass is worth real money. The whole speculative decoding ecosystem exists because GPUs are bandwidth-bound on autoregressive workloads.
General Compute serves on inference-optimized ASICs. The target forward pass is already fast, which changes the math on speculation. The savings per accepted token are smaller in absolute terms, but the latency floor is lower to start with, and techniques like speculative decoding still compound on top. In practice we see customers run smaller drafts (often 1B class) and lean harder on prefix caching and disaggregated prefill, because once the target is fast, the marginal value of speculation is bounded by how much draft cost you can amortize.
If you are picking a draft model right now, the short version is: same family, 1B to 3B parameters, FP8 quantized, and measure acceptance rate on real traffic before you tune anything else. Get those four right and you will capture most of the available speedup.
Sign up at generalcompute.com and get $5 in free credit to try inference where speculative decoding stops being the only thing keeping latency bearable.