Speculative Decoding: Getting 3x Speedups Without Changing the Model
LLMs generate text one token at a time. Each token requires a full forward pass through the model, and each pass is bottlenecked by memory bandwidth (how fast you can read the model's weights from memory), not by compute (how fast you can do the math). This means the GPU sits mostly idle during generation, waiting on memory.
Speculative decoding attacks this problem with a simple idea: use a small, fast model to guess multiple tokens ahead, then verify all those guesses in a single pass through the large model. When the guesses are right (and they often are), you get multiple tokens for the cost of one large-model pass.
The best part: the output is mathematically identical to what the large model would have produced on its own. No quality tradeoff.
Why Standard Decoding Is Slow
To understand why speculative decoding helps, you need to understand why normal decoding is inefficient.
During generation, each forward pass through a large model (say, 70 billion parameters) requires reading all those parameters from GPU memory. On an A100, reading 70B parameters in FP16 means moving about 140GB of data through a memory bus that tops out at around 2TB/s. That's roughly 70ms just for the memory transfer, regardless of how fast the math is.
The actual matrix multiplications for a single token use only a small fraction of the GPU's compute capacity. The arithmetic intensity (ratio of compute to memory access) is very low during decoding. The GPU's tensor cores are mostly idle, waiting for data to arrive from memory.
This means that processing one token and processing several tokens in parallel costs almost the same wall-clock time, because the bottleneck is reading the model weights, and you read those weights regardless of how many tokens you're processing (this is why the prefill phase, where you process the entire input prompt in parallel, is much more efficient per token than decoding).
Speculative decoding exploits exactly this property.
How It Works
The algorithm uses two models: a small, fast "draft" model and the full-size "target" model you actually want to serve.
Step 1: Draft. The small model (something like a 1-2B parameter model from the same family) generates K candidate tokens autoregressively. Because the draft model is tiny, this is very fast, maybe 5-10ms for K=5 tokens.
Step 2: Verify. Feed all K draft tokens into the target model in a single forward pass. The target model processes them in parallel (like a mini-prefill), producing probability distributions for each position. This single pass costs about the same as generating one token normally.
Step 3: Accept or reject. For each draft token, compare the draft model's probability with the target model's probability using a specific acceptance criterion:
- Accept the token with probability min(1, p_target(token) / p_draft(token))
- If rejected, resample from a corrected distribution: normalize(max(0, p_target - p_draft))
Step 4: Return. All accepted tokens plus one new token (either the resampled replacement or the next token after all accepted) become the output for this step.
The acceptance/rejection scheme is the mathematical core. It guarantees that the final output distribution is exactly equal to sampling from the target model alone. This isn't an approximation or a heuristic. It's a provable guarantee. You get identical quality with fewer target model forward passes.
How Much Faster Is It?
The speedup depends on how well the draft model matches the target model's distribution. When the draft model predicts the same tokens the target model would have chosen (which happens frequently for common patterns, boilerplate code, and predictable text), most tokens get accepted and you get close to K+1 tokens per target model pass.
In practice, typical acceptance rates are 70-85% for well-matched draft/target pairs (like using Llama 3 8B to draft for Llama 3 70B). This translates to 2-3x wall-clock speedups on generation.
The speedup formula is roughly: (average_tokens_accepted + 1) / (K * cost_draft/cost_target + 1). Since the draft model is 10-50x smaller, the cost_draft/cost_target ratio is very small, so the denominator stays close to 1.
The Two Original Papers
Speculative decoding was independently discovered by two teams at almost the same time:
Leviathan et al. (Google, November 2022) published "Fast Inference from Transformers via Speculative Decoding" and demonstrated the technique on T5-XXL, showing 2-3x acceleration with no quality degradation. They formally proved the output distribution equivalence.
Chen et al. (DeepMind, February 2023) published "Accelerating Large Language Model Decoding with Speculative Sampling" and validated the approach on Chinchilla 70B in distributed settings, showing 2-2.5x speedups. They called their version "speculative sampling" and provided a slightly different but equivalent mathematical formulation.
Both papers arrived at the same core idea independently, which usually means the idea is fundamental. And it has proven to be exactly that. Speculative decoding is now supported in every major serving framework (vLLM, TensorRT-LLM, SGLang) and used by most inference providers.
Where Speculative Decoding Shines
The technique works best when:
- The draft model is a good predictor of the target. Models from the same family work well (Llama 8B drafting for Llama 70B). The more the distributions align, the higher the acceptance rate.
- The output is somewhat predictable. Code generation, structured output (JSON), and formulaic text have high acceptance rates. Creative, high-temperature generation has lower rates.
- You care about latency, not just throughput. Speculative decoding helps individual request latency. Under very high load, the extra compute for the draft model can actually reduce overall throughput. It's a latency optimization, not a throughput optimization.
- The model is large enough that decoding is memory-bound. For very small models (7B and under), decoding is already fast enough that the overhead of running a draft model doesn't pay off.
The Hardware Angle
Speculative decoding is, at its core, a workaround for the memory bandwidth bottleneck of GPU-based inference. The entire technique exists because reading 70B+ parameters from HBM is slow, and the GPU's compute capacity goes to waste during that read.
General Compute runs entirely on inference-optimized ASICs instead of NVIDIA GPUs, and the memory bandwidth equation on these chips is fundamentally different. The bottleneck between memory and compute is much narrower for inference workloads, which means the baseline decoding speed is already closer to what speculative decoding tries to achieve on GPUs. And when we apply speculative decoding on top of that, the gains compound on an already-fast baseline.
The result is inference speed that GPU-based systems can't match even with perfect speculative decoding implementations. Sign up at generalcompute.com and get $5 in free credit to see for yourself.
Papers and References
- Fast Inference from Transformers via Speculative Decoding (Leviathan et al., 2022 -- ICML 2023)
- Accelerating Large Language Model Decoding with Speculative Sampling (Chen et al., 2023)