Lookahead Decoding: Parallel Token Generation Without Draft Models
Speculative decoding has become the default way to accelerate autoregressive generation. The recipe is well known: run a small draft model to propose several tokens, then have the big model verify them in parallel. When the draft is good, you get multiple tokens per forward pass. When it is bad, you fall back to one token per pass. The catch is the draft model itself. You have to train or pick one, host it, keep its tokenizer aligned, and pay the memory cost of a second set of weights.
Lookahead decoding, introduced by Fu et al. at LMSYS in late 2023, gets you parallel token generation without a draft model. It uses the target model itself to fill multiple token positions per step, then verifies the guesses in the same forward pass. There is no second model, no separate training, no tokenizer alignment problem. You drop it into an existing serving stack and decode runs faster on the same weights.
The idea sits on top of a classical numerical method called Jacobi iteration. It turns out that autoregressive decoding is structurally similar to a fixed-point problem, and Jacobi iteration is the textbook way to solve those in parallel. Lookahead decoding adapts the technique to language models, adds an n-gram cache to recycle work across steps, and folds the result into a single forward pass per iteration. This post walks through how it works, why it is faster than naive Jacobi, where it stops helping, and how it compares to draft-model approaches.
Why Decoding Is Sequential in the First Place
The standard autoregressive loop is sequential because each token depends on every token before it. Position t cannot be sampled until position t-1 has been chosen, embedded, and propagated through every transformer layer. There is no way to parallelize across the time dimension during generation, even though the prefill phase can process the entire prompt in one pass.
Speculative decoding sidesteps this by guessing. If you can produce a plausible sequence of k tokens cheaply (with a draft model), you can run the target model once on those k positions, check whether each guess matches what the model would have sampled, and accept the longest matching prefix. The verification is cheap because it is a single batched forward pass. The expensive part is the guessing, which is why the draft model has to be small.
Lookahead decoding takes a different angle. Instead of guessing with a separate model, it uses the target model to refine its own guesses across iterations. Each step generates new candidate tokens for several future positions, and over several iterations those candidates converge to the true autoregressive sequence. The key insight is that the convergence can happen in parallel inside a single forward pass.
Jacobi Iteration Applied to Decoding
Jacobi iteration is a method for solving systems of equations in parallel. Given a fixed-point equation x = f(x), you start with an initial guess and apply f to all components at once, getting a new guess. You repeat until the guesses stop changing. The appeal is that every component update is independent, so the work parallelizes well.
Autoregressive decoding can be cast as a fixed-point problem. Define a window of n future token positions. The "true" tokens for those positions satisfy y_i = sample(model(x, y_{<i})) for each i. If you start with random or repeated guesses for y_1 through y_n, you can apply the model in parallel to all positions and get updated guesses. After enough iterations, the guesses converge to the actual autoregressive output.
The naive version of this (Jacobi decoding, proposed independently a few years earlier) runs the model n times in a row, each iteration refining all n positions. The cost per iteration is one forward pass over n positions, so the total cost is roughly the same as standard decoding when the sequence has fully converged. The wins come from positions that converge early. If the first three tokens settle after one pass instead of three, you got two tokens for free.
In practice, naive Jacobi only helps a little. The convergence is slow, and most positions need many iterations to lock in. The expected speedup on real workloads is closer to 1.1x than to anything dramatic.
The Lookahead Trick
Lookahead decoding is what you get when you take Jacobi and squeeze more parallelism into each forward pass. It packs three things into a single call to the model:
The lookahead branch generates new guesses for a sliding window of W future positions across N parallel "trajectories." Each trajectory is its own Jacobi iteration in flight. With W positions and N trajectories, the lookahead branch contributes N times W token positions to the forward pass.
The verification branch takes a set of candidate n-grams (short token sequences) collected from previous iterations and checks them against what the model would actually generate. If a candidate n-gram matches, the corresponding tokens get appended to the output without further computation.
The combination runs in one forward pass over a carefully constructed attention mask. The mask makes the lookahead branch positions attend to the right context for their trajectory, and makes the verification branch positions attend to the candidate n-gram they are checking. The model effectively does several jobs at once: it advances Jacobi iteration on N trajectories, while simultaneously trying to land an n-gram match.
The n-gram pool is the part that turns lookahead from "Jacobi with extra steps" into something genuinely fast. As the lookahead branch generates new tokens for future positions, those tokens form short n-gram sequences. The pool collects them. On subsequent iterations, the verification branch pulls candidates from the pool that match the recent context. Many of these candidates are correct, especially in code, structured output, or repetitive natural language, where short token sequences recur within a generation.
Why It Works on Language Models
The interesting empirical fact is that language model generations contain a lot of local repetition at the token level. Numbers, identifiers, common phrases, code constructs, and formatting tokens all reappear within a single output. The n-gram pool captures these and feeds them back as cheap candidates.
The other piece is that early positions in a Jacobi iteration window tend to converge fast. Once you have committed to a token at position t, the token at position t+1 is much easier to predict than it was when t was still uncertain. The lookahead branch exploits this: it keeps several trajectories in flight, and the model often gets a few of them right enough that they contribute usable n-grams to the pool.
The combination, lookahead branch plus verification branch plus n-gram pool, gets multiple tokens per step on average. The original LMSYS paper reports 1.5x to 2.3x speedups on standard decoding benchmarks for models from 7B to 70B, with no model modifications and no draft model.
What the Forward Pass Looks Like
Concretely, a lookahead decoding step takes input that is much wider than a normal decode step. A normal decode step processes 1 token (the most recent one) and reads the KV cache for the prefix. A lookahead step processes something like (N times W) plus (G times M) tokens, where N is the number of trajectories, W is the lookahead window size, G is the number of n-gram candidates, and M is the n-gram length.
Because the model is bandwidth-bound during decode, you are usually leaving compute on the table. Lookahead spends some of that idle compute. The forward pass is more expensive than a standard decode step in raw FLOPs, but the wall clock cost barely moves because the bottleneck is loading weights from HBM, not the matmul itself. As long as the extra positions fit within the time it takes to stream the weights through, they are nearly free.
There is a ceiling. If you push N and W high enough, you saturate the tensor cores and the forward pass starts costing real time. The optimal configuration depends on the model, the GPU, and the batch size. The LMSYS reference implementation tunes these parameters per model. On an A100 with a 7B model, typical settings are W around 5, N around 5, G around 3 to 5.
The attention mask is the trickiest part of the implementation. Each lookahead trajectory needs to see only its own previous-step guesses, plus the committed prefix. Each verification candidate needs to see only the prefix plus the candidate itself. Building the mask correctly, and getting FlashAttention or whatever kernel you are using to respect it, takes some plumbing. The original LMSYS code release is the cleanest reference.
Comparison to Speculative Decoding
The two approaches solve the same problem with different mechanics. Speculative decoding outsources guessing to a small model. Lookahead outsources guessing to past iterations of the target model itself.
The practical tradeoffs:
Speculative decoding gets larger speedups (often 2x to 4x) when a good draft model is available. The draft has to be aligned with the target in tokenizer and behavior, and training or finetuning a draft adds engineering cost. Memory cost is real, since the draft model lives in GPU memory alongside the target.
Lookahead decoding has a lower ceiling on speedup (typically 1.5x to 2.3x on standard benchmarks) but requires no draft model and no training. You can apply it to a freshly downloaded checkpoint without any preparation. Memory overhead is tiny: the n-gram pool is a small in-memory data structure.
Speculative decoding wins on raw speed when you have already paid the draft model setup cost. Lookahead wins on operational simplicity, especially for models where no good draft exists, for short workloads where draft training is not worth it, or for serving fleets that want a single uniform decode path.
The two are not mutually exclusive. You can run lookahead inside the verification step of a speculative loop, or use lookahead as a fallback when the draft is unavailable for a particular model. In practice most production stacks pick one or the other based on what the team is willing to maintain.
Where the Wins Are Largest
Lookahead helps most on workloads with predictable local structure. Code generation, especially in languages with rigid syntax, sees the largest gains because the n-gram pool catches common patterns like indentation, keywords, and bracket pairs. Structured output (JSON, SQL, function calls) similarly benefits from short repeating sequences.
Natural language with high entropy, like creative writing or open-ended chat, sees smaller gains. The n-gram pool fills up with one-off phrases that rarely match, and most of the speedup comes from the lookahead branch alone. You might see 1.3x to 1.5x rather than 2x.
Long generations help more than short ones. The pool needs a few hundred tokens to warm up. For very short outputs (under 50 tokens), the overhead of building the lookahead context can outweigh the savings on the first few iterations.
Batch size matters too. Lookahead decoding works best at batch size 1 to 4. As the batch size grows, the GPU stops being bandwidth-bound and the extra positions in the lookahead branch start costing real time. By batch size 16 or 32, the speedup is often gone. This makes lookahead a good fit for low-latency serving (where small batches are the norm) and a poor fit for throughput-maximizing batch inference.
Implementation in Serving Stacks
By early 2026, lookahead decoding has been picked up by several serving frameworks. vLLM has an experimental lookahead path. SGLang ships it as a configurable decoding mode. Custom inference stacks at major providers commonly include some form of lookahead, often blended with speculative decoding.
The integration work is mostly about three things: building the right attention mask, managing the n-gram pool efficiently across requests, and tuning W, N, and G per model. Continuous batching adds complications because requests at different stages of generation need different lookahead configurations, and the masks have to compose correctly.
The simplicity advantage holds up in production. There is no draft model lifecycle to manage, no tokenizer alignment to verify, and no separate set of weights to update when the target model changes. A team running ten different open models can enable lookahead on all of them without doing per-model draft training.
At General Compute, we use a mix of techniques on different parts of our stack. Speculative decoding does the heavy lifting where we have well-tuned drafts. Lookahead decoding fills the gap on newer or less common models where investing in a draft is not justified yet. The combination keeps decode latency low across a wide range of workloads. If you are building something latency-sensitive on top of inference, our API exposes the result of these optimizations directly. Take a look at the docs to see what the per-token latencies look like for the models you are using.