What Is Speculative Decoding? How It Makes LLMs 3x Faster

Large language models generate text one token at a time. To produce the next token, the model reads everything written so far, runs a full forward pass through every layer, and emits a single token. Then it does the whole thing again for the token after that. This is the central reason LLM inference feels slow: a 70B model might take 20 to 30 milliseconds per token, and a 500-token response is 500 sequential passes through the network with no way to skip ahead.

Speculative decoding is a technique that breaks this sequential bottleneck without changing the model's output. The idea is to guess several tokens at once with a cheap method, then check all of those guesses in a single pass of the expensive model. When the guesses are right, you get multiple tokens for the price of one forward pass. When they are wrong, you fall back to normal generation and lose almost nothing. In practice this commonly produces 2x to 3x faster generation, and sometimes more, while producing exactly the same text the model would have produced on its own.

This guide explains how speculative decoding works, why it does not degrade quality, the main variants you will encounter (including Medusa and Eagle), and what kind of speedups are realistic.

The bottleneck: why decoding is sequential

It helps to be precise about where the time goes. LLM inference has two phases. The first is prefill, where the model processes your entire prompt in parallel and builds up the key-value cache. Prefill is compute-bound and reasonably efficient because all the prompt tokens are available at once and can be processed together.

The second phase is decode, where the model generates the response token by token. Decode is memory-bandwidth-bound. For each token, the hardware has to read the full set of model weights out of memory to compute one forward pass, and the arithmetic involved is small relative to the cost of moving those weights. This is the key observation that makes speculative decoding work: when you generate a single token, the GPU spends most of its time moving weights and very little of its compute capacity is actually used. There is idle arithmetic headroom on every decode step.

A normal decode step processes one token position. But a forward pass can score many positions at once for almost the same memory cost, because the weights only need to be loaded once regardless of how many positions you push through them. That is exactly what prefill exploits. Speculative decoding is, in a sense, a way to make the decode phase behave a little more like prefill by giving the model several candidate positions to evaluate in one shot.

The core idea: draft and verify

Speculative decoding uses two models, or two mechanisms, working together.

The draft model is small and fast. It might be a 1B model paired with a 70B target, or a 7B model paired with a 405B target. Its job is to quickly propose a short sequence of candidate tokens, say the next four or five tokens, by running its own fast autoregressive generation. Because it is small, generating those candidates is cheap.

The target model is the large, accurate model whose output you actually want. Instead of generating tokens one at a time, the target model takes the draft's proposed tokens and verifies all of them in a single forward pass. This is the part that saves time: scoring five candidate positions in one pass costs roughly the same as generating one token normally, because the dominant cost is loading the weights, which happens once.

Here is the loop in plain terms:

The draft model proposes the next K tokens (for example, 5).
The target model runs one forward pass over those K positions, producing its own probability distribution at each position.
You compare the draft's tokens against what the target model would accept. Accept the longest prefix the target agrees with, and reject the rest.
The target model's own prediction for the first rejected position gives you one guaranteed-correct token for free.
Repeat from the new position.

If the draft proposed 5 tokens and the target accepts the first 3, you have produced 4 correct tokens (the 3 accepted plus 1 from the target's correction) in a single target forward pass instead of 4 passes. If the draft is wrong immediately, you still get 1 correct token from the target, which is exactly what you would have gotten with normal decoding. The downside case costs you only the small overhead of running the draft model.

Why the output quality is identical

The most common worry about speculative decoding is that a small, less accurate draft model will pollute the output. It does not, and this is worth understanding because it is what makes the technique safe to deploy.

The trick is in the acceptance step. The original speculative sampling algorithm (introduced in work from DeepMind and Google in 2023) uses a rejection sampling scheme that is mathematically proven to produce tokens from exactly the same distribution as the target model alone. When a draft token is accepted, it is accepted with a probability tied to the ratio of the target and draft probabilities for that token. When it is rejected, the next token is resampled from an adjusted distribution. The net effect is that the sequence of tokens you produce is statistically identical to sampling directly from the target model.

So the draft model never decides the output. It only proposes candidates that the target model either ratifies or overrules. A bad draft model does not make the output worse; it just makes fewer of its guesses accepted, which reduces the speedup. A good draft model gets more guesses accepted and gives you a bigger speedup. Quality is pinned to the target model in all cases. (For greedy decoding the argument is even simpler: you accept a draft token only if it matches the target's argmax, so the result is bit-for-bit what greedy decoding on the target would produce.)

This property is what separates speculative decoding from approximations like aggressive quantization or distillation, which trade some accuracy for speed. Speculative decoding trades nothing on quality. It is a pure latency optimization.

What determines the speedup

The speedup depends on one number above all: the average number of tokens accepted per target forward pass, often called the acceptance rate or the average accepted length. A few factors drive it.

How well the draft model predicts the target. If the draft model frequently agrees with the target, more of its tokens are accepted and you get longer accepted runs. A draft model trained or distilled to mimic the target performs much better than an unrelated small model. This is why the best results often come from draft and target models in the same family.

The predictability of the text. Some sequences are easy. Boilerplate, common phrases, code with rigid syntax, and repetitive structure are all highly predictable, so the draft model nails long runs and acceptance is high. Novel or surprising content is harder to predict, so acceptance drops. This means the speedup you see varies with the workload, and code generation often benefits more than open-ended creative text.

The draft length K. Proposing more tokens per round raises the ceiling on how many you can accept, but it also raises the cost when the draft is wrong, because you spent draft compute on tokens that got rejected. There is an optimal K for a given draft/target pair and workload, usually somewhere between 3 and 8. Some systems tune K dynamically based on recent acceptance.

Because of these factors, real speedups range widely. A typical, well-matched setup lands in the 2x to 3x range for latency on a single stream. Easy workloads and strong draft models can push past that. Hard workloads or poorly matched drafts give you less.

The variants: self-drafting and tree attention

The classic two-model approach works, but it has a practical cost: you have to host, load, and run a second model. That motivated a family of variants that avoid a separate draft model or get more out of each verification pass.

Medusa

Medusa removes the separate draft model entirely. Instead, it attaches several small extra prediction heads to the target model itself. The original model has one head that predicts the next token; Medusa adds extra heads that predict the token two positions ahead, three positions ahead, and so on, all from the same hidden state. These heads are lightweight and are trained while the base model stays frozen.

To generate, the Medusa heads each produce several candidate tokens for their position, and those candidates are arranged into a tree of possible continuations. The model then verifies the whole tree in one forward pass using a specially constructed attention mask (tree attention) so that many candidate sequences are checked simultaneously. The longest valid path through the tree is accepted. Because there is no separate model to run, Medusa is simpler to deploy and avoids the memory overhead of a second network, at the cost of a short training step to learn the extra heads.

Eagle

Eagle (and its successors Eagle-2 and Eagle-3) refines the drafting step by doing the autoregressive prediction at the feature level rather than the token level. Instead of predicting raw tokens, Eagle's lightweight draft module predicts the target model's internal feature representations, which turn out to be more predictable and regular than tokens, and it conditions on a token of context to resolve ambiguity. This produces more accurate drafts and therefore higher acceptance rates. Combined with tree-style verification, Eagle has reported some of the highest speedups among speculative methods, often in the 3x to 4x range on common benchmarks, while keeping the same lossless guarantee. The draft module is small and trained on top of a frozen target model, similar in spirit to Medusa but with a more capable drafting mechanism.

Lookahead and n-gram methods

Another branch skips learned drafting altogether. Lookahead decoding generates and verifies n-grams in parallel using a fixed algorithm with no draft model and no extra training. Simpler prompt-lookup methods just copy candidate spans from the prompt itself, which works surprisingly well for tasks like summarization, code editing, or retrieval-augmented generation where chunks of the output are likely to appear verbatim in the input. These are easy to bolt on because they require no additional weights, though they generally accept fewer tokens than a well-trained draft model.

A minimal mental model in code

You do not usually implement this yourself, since serving frameworks handle it, but a sketch of the verify loop makes the mechanics concrete:

def speculative_step(target, draft, context, K):
    # 1. Draft proposes K tokens autoregressively (cheap)
    draft_tokens = draft.generate(context, max_new_tokens=K)

    # 2. Target scores all K positions in ONE forward pass
    target_logits = target.forward(context + draft_tokens)  # one pass

    # 3. Accept the longest prefix the target agrees with
    accepted = []
    for i, tok in enumerate(draft_tokens):
        if target_accepts(target_logits[i], tok):  # rejection sampling test
            accepted.append(tok)
        else:
            break

    # 4. The target's own prediction at the first rejection is a free token
    correction = sample(target_logits[len(accepted)])
    accepted.append(correction)

    return accepted  # 1..K+1 tokens from a single target pass

The important line is the single target.forward call covering all K positions. That one pass is where the saving comes from, because its cost is dominated by loading weights once, not by the number of positions evaluated.

When speculative decoding helps and when it does not

Speculative decoding is most valuable for latency-bound, low-batch workloads: a single user waiting on a response, a voice agent that needs to start speaking quickly, a coding assistant completing a function. In these cases the GPU has plenty of spare compute on each decode step, and speculative decoding puts that idle compute to work checking guesses.

It helps less, and can occasionally hurt, when the system is already running at large batch sizes for maximum throughput. When a server is batching many requests together, the decode step is no longer underutilized; the spare compute that speculative decoding relies on is already being used to process other requests. Verifying speculative tokens then competes for the same resources, and the extra draft work can reduce total tokens-per-second across the batch. Production systems often enable speculation adaptively, using it when batch sizes are small and backing off when the server is saturated.

The other practical cost is engineering. Two-model setups need extra memory for the draft model and careful management of two KV caches. Medusa and Eagle need a training step to fit the extra heads or draft module. None of this is exotic anymore, since most major serving stacks support one or more of these methods, but it is real work to set up and tune.

Where this fits

Speculative decoding is one of the more satisfying optimizations in LLM serving because it improves latency without asking you to give anything up on output quality. The text is the same; it just arrives faster. That is a different bargain than quantization or distillation, and it composes cleanly with them: you can quantize the target model and still run speculative decoding on top.

For interactive products, where the latency of each token is something a user actually feels, this matters a great deal. Voice agents, real-time coding tools, and agentic systems that chain many model calls all live or die on per-token speed, and a 2x to 3x reduction in generation latency changes what those products can do.

General Compute runs inference infrastructure built for exactly these latency-sensitive workloads, with speculative methods and other decode optimizations applied under the hood, behind an OpenAI-compatible API so you can point an existing app at it without rewriting anything. If your application is the kind where token latency is the thing users feel, it is worth benchmarking against your current setup. You can read more in the docs and run the numbers on the workload you actually serve.