Medusa, EAGLE, and Sequoia: The Next Generation of Speculative Decoding
The original speculative decoding papers (covered in our previous post) showed that you could get 2-3x speedups by using a small draft model to guess tokens ahead, then verifying them in bulk. But they had practical limitations: you needed to find, deploy, and serve a separate draft model alongside your target model, and the speedup was capped by how well that draft model matched the target's predictions.
In 2024, three papers pushed speculative decoding significantly further. Medusa added extra prediction heads directly to the target model. EAGLE found that predicting in feature space (the model's internal representations) is easier than predicting tokens. And Sequoia figured out the optimal tree structure for verifying multiple candidate continuations at once.
Medusa: No Draft Model Needed
Medusa (Cai et al., January 2024) takes a different approach to speculation. Instead of running a separate draft model, it adds multiple lightweight "heads" on top of the target model itself. Each head predicts a token at a different future position: head 1 predicts the token at position t+1, head 2 predicts t+2, head 3 predicts t+3, and so on.
These heads are small (they add less than 2% to the model's total parameter count) and can be trained on a relatively small amount of data. Since they sit on top of the target model and share its internal representations (the rich understanding the model has built up through all its layers), they have much better information to work with than a separate small model would.
The clever part is how verification works. Medusa doesn't just check one linear sequence of candidates. It constructs a tree of possible continuations (for example, if head 1 predicts tokens A or B, and head 2 predicts tokens C or D, you get a tree with branches AC, AD, BC, BD) and uses tree-structured attention to verify multiple branches in a single forward pass.
Results: 2.2-3.6x speedup on various models without needing a separate draft model at all. Medusa-1 only trains the extra heads (leaving the base model frozen), while Medusa-2 jointly fine-tunes the heads and the base model for even higher acceptance rates.
Tradeoff: You need to train the Medusa heads for each model you want to serve, which adds a preparation step that vanilla speculative decoding doesn't require.
EAGLE: Predicting Features Instead of Tokens
EAGLE (Li et al., January 2024) started from a simple observation: predicting what token comes next is hard (that's the whole reason we need large language models in the first place). But predicting what the model's internal features (its hidden state vectors, the numerical representations it builds as it processes text) will look like at the next position is much easier, because features change more smoothly and predictably than the discrete token distribution.
EAGLE trains a lightweight autoregressive head that operates on the target model's second-to-top-layer features. Given the current feature vector, it predicts the next feature vector, which is then projected to a token distribution for verification. The trick is that it also uses the token embedding from one step ahead as additional input, which resolves a lot of the uncertainty.
Like Medusa, EAGLE uses tree-structured verification to check multiple candidates in one forward pass.
Results: EAGLE achieves 2.7-3.5x latency speedup on Llama 2 Chat 70B with a provable guarantee that the output distribution is identical to standard decoding. This makes it faster than Medusa while maintaining the lossless property.
EAGLE-2 (June 2024) improved on this by making the draft tree structure context-dependent. Instead of using a fixed tree shape for every input, EAGLE-2 dynamically constructs the tree based on the confidence of each prediction, allocating more branches where the model is uncertain and fewer where it's confident. This increased the average number of accepted tokens per step without any additional training.
EAGLE-3 (March 2025) went further by abandoning feature prediction entirely in favor of direct token prediction, combined with multi-layer feature fusion. Earlier EAGLE versions hit diminishing returns when trained on more data. EAGLE-3's architecture scales better, continuing to improve with more training examples.
Sequoia: Hardware-Aware Optimal Trees
Sequoia (Chen et al., February 2024) approached the problem from a systems perspective. Both Medusa and EAGLE use tree-structured verification, but how do you pick the best tree shape?
Sequoia uses dynamic programming (an algorithmic technique for finding optimal solutions by breaking problems into subproblems) to find the tree topology (number of branches, depth at each level) that maximizes the expected number of accepted tokens, given the draft model's token probabilities. The optimal tree shape depends on the draft model's accuracy and the available compute budget.
Critically, Sequoia also makes the tree structure hardware-aware. The optimal tree for an A100 (high memory bandwidth, moderate batch capacity) is different from the optimal tree for an L40 (less bandwidth, different compute characteristics) or a CPU-offloaded setup. Sequoia's optimizer automatically adapts to the target hardware.
Results: Up to 4.04x speedup on an A100 for Llama 2 7B. And for offloaded inference (where the model partially lives in CPU memory or NVMe storage because it doesn't fit entirely in GPU memory), Sequoia achieves up to 9.96x speedup, bringing Llama 2 70B to 0.56 seconds per token on an L40 GPU that couldn't practically serve the model otherwise.
Sequoia also introduced a novel sampling and verification method that works well at higher temperatures (where the model's output is more random and creative). This was a weakness of earlier speculative decoding methods, which tended to see lower acceptance rates with high-temperature sampling.
How They Compare
| Method | Speedup (70B) | Needs Draft Model? | Lossless? | Extra Training? | |---|---|---|---|---| | Vanilla Speculative Decoding | 2-3x | Yes (separate model) | Yes | No | | Medusa | 2.2-3.6x | No (heads on target) | Yes | Yes (heads) | | EAGLE | 2.7-3.5x | No (feature predictor) | Yes | Yes (predictor) | | EAGLE-2 | Better than EAGLE | No | Yes | Same as EAGLE | | Sequoia | Up to 4x (9.96x offloaded) | Yes | Yes | No |
The general trend: each new method finds a smarter way to speculate. Medusa eliminated the separate draft model. EAGLE made predictions more accurate by working in feature space. EAGLE-2 made the verification tree adaptive. Sequoia optimized the tree shape for specific hardware.
Prompt Lookup Decoding: The Zero-Overhead Approach
Worth mentioning alongside these methods: prompt lookup decoding (Apoorv Saxena, November 2023) is the simplest form of speculation. It doesn't use a model at all. Instead, it looks for n-gram matches (repeating sequences of tokens) between the input prompt and recently generated text. When it finds a match, it uses the tokens that followed that pattern in the prompt as draft candidates.
This is surprisingly effective for tasks where the output is likely to repeat parts of the input: summarization, question answering with context, code editing, and structured output. It achieves 2-4x speedup on these tasks with literally zero model overhead. It's now built into HuggingFace Transformers and vLLM.
Why ASICs Compound These Gains
All of these techniques share a common foundation: they get more useful tokens out of each target model forward pass. On GPUs, each forward pass is memory-bandwidth-bound (the GPU spends most of its time waiting to read model weights from memory), so you're fundamentally limited by how fast the memory bus can deliver data.
General Compute is the only neocloud built entirely on inference-optimized ASICs instead of NVIDIA GPUs. On these chips, the memory bandwidth equation is fundamentally different. The baseline forward-pass latency is already much lower, and when speculative decoding techniques generate multiple tokens per pass, each of those "free" tokens arrives faster.
A technique that gives you 3x more tokens per forward pass on a GPU with 70ms per pass saves you 140ms. The same technique on an ASIC with 20ms per pass still saves you 40ms, but your absolute latency is dramatically lower. The gains from speculative decoding and the gains from inference-optimized hardware multiply together.
Sign up at generalcompute.com and get $5 in free credit to see what compounded inference optimization feels like.
Papers and References
- Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads (Cai et al., 2024 -- ICML 2024)
- EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty (Li et al., 2024 -- ICML 2024)
- EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees (Li et al., 2024 -- EMNLP 2024)
- EAGLE-3: Scaling up Inference Acceleration via Training-Time Test (Li et al., 2025)
- Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding (Chen et al., 2024 -- NeurIPS 2024)
- Prompt Lookup Decoding (Saxena, 2023)