Cascade Inference: Using Small Models to Route to Big Ones
Most LLM workloads have a long tail of queries that a 7B model could handle perfectly well, mixed with a head of genuinely hard queries that need a 70B or frontier model. Serving everything with the biggest model is the simplest design and also the most wasteful one. A cascade turns that waste into savings by trying a small model first, checking whether the answer is good enough, and only promoting to a bigger model when the small one falls short.
The canonical reference here is FrugalGPT by Chen, Zaharia, and Zou at Stanford, published in 2023. They showed that a well-tuned cascade across multiple commercial APIs could match GPT-4 quality on several benchmarks while cutting cost by up to 98 percent. The paper laid out three ideas: prompt adaptation (compressing or pruning prompts), LLM approximation (caching and distillation), and LLM cascade (the pattern we will focus on here). The cascade was the most interesting of the three because it generalizes to any heterogeneous fleet of models, whether those are open-weight models of different sizes, third-party APIs at different price points, or a mix.
The basic structure is simple. What makes cascades work or fail in production is the quality of the scoring function that decides whether an answer is good enough to accept, and the latency you pay when the cascade misses and has to escalate. This post walks through how cascades work, how they compare to upfront routing, where they help, where they hurt, and how to build one without creating more problems than you solve.
The Cascade Pattern
A cascade is a pipeline of models ordered from cheapest to most expensive. Each stage tries to answer the query. After each stage, a scoring function inspects the answer and decides whether to accept it or pass the query to the next stage. If no stage accepts, you either return the best answer collected so far or escalate to the final stage unconditionally.
The FrugalGPT setup uses three commercial APIs as the stages. In an open-weight setting, the same pattern might look like Qwen 2.5 3B, Qwen 2.5 7B, Qwen 2.5 72B, with the tiny model catching simple lookups, the mid-size model handling most reasoning, and the large model reserved for the queries where both smaller models fail.
The scoring function is the interesting part. A few options are common:
A separate small model trained as a judge. The judge sees the query and the candidate answer and outputs a confidence score. This works but adds its own cost and latency, and you have to make sure the judge is cheaper than the next stage in the cascade, otherwise you would be better off just running the next stage directly.
The log-probability of the generated answer from the model that produced it. A model that is confident in its answer assigns higher probability to the tokens it generated. This is cheap because you already have the logits. It is also noisy, since confidence does not always correlate with correctness, but it works well enough in many cases.
A verifier head or separate reward model. Some teams train a small classifier on (query, answer) pairs with ground-truth labels of "correct" or "incorrect" for their domain. This is the most accurate option when you have labeled data, and the worst option when you do not.
Heuristic checks for specific failure modes. For structured output tasks, you can parse the answer as JSON or SQL and reject anything that fails to parse. For code generation, you can run the code against unit tests. These checks are extremely cheap and very accurate for the narrow failures they catch, but they do not generalize to open-ended quality judgment.
The scoring function has to be substantially cheaper than the next stage, or the cascade stops saving money. A rule of thumb: the judge should cost at most 10 to 20 percent of the cost of running the next stage, including its own latency.
Routing as an Alternative
Cascades do their routing after generation. An alternative is to route before generation: classify the incoming query and send it to the right model directly, skipping the smaller stages when you already know they are going to fail.
RouteLLM (Ong et al., 2024) is the best-known example of this. They train a classifier on a mix of labeled query-model-outcome data and preference data, and use the classifier to decide whether a query goes to a strong model or a weak model. The classifier itself is small, often a fine-tuned BERT or an encoder-only model, and runs in a few milliseconds.
Routing has two advantages over cascades. First, latency is bounded. A cascade that escalates from small to medium to large pays the sum of those generation times in the worst case. A router that picks "large" upfront pays only the generation time of the large model plus the router itself. Second, routing does not require a scoring function that can judge answer quality, which is often the hardest component to build.
The disadvantage is that routing has to predict the right model from the query alone, without seeing the answer. That is a harder problem than judging an answer after the fact. Routing classifiers typically max out around 75 to 85 percent accuracy on the "easy vs hard" distinction, and the errors on the margin cost you quality.
In practice, many production systems combine both: a coarse router upfront that filters out queries that obviously need the big model (long context, complex reasoning chains, multi-step tool calls), followed by a cascade over the remaining queries that uses small models as the default and escalates when they fall short.
What the Trade-offs Look Like
The cost savings from a cascade depend entirely on the distribution of queries and the quality of the scoring function.
If 80 percent of your queries can be answered correctly by the small model, and your scoring function has 95 percent precision on "small model got it right," then 76 percent of your queries are answered by the small model alone. The remaining 24 percent escalate. If the middle stage handles half of those (12 percent), and the big stage handles the rest (12 percent), you end up paying roughly (76 times small_cost) + (24 times small_cost) + (12 times mid_cost) + (12 times big_cost). If the big model is 20x the cost of the small model and the mid model is 5x, this works out to roughly 3.8 units of cost per query, versus 20 units if you ran every query through the big model. That is about 80 percent savings.
The same math shows why cascades break. If the scoring function is wrong, you either escalate too much (losing savings) or accept too often (losing quality). A scoring function at 70 percent precision usually erodes most of the cost benefit. A scoring function at 99 percent precision on the correctness of the small-model answer is hard to build, because it is almost as hard as generating the answer in the first place.
The latency story is worse than the cost story. Every query that escalates pays two or three sequential generations plus two or three scoring evaluations. If your small model generates an answer in 300ms and your judge adds 50ms, a query that escalates to the big model (say 1.5s of generation) ends up at 1.85s total instead of 1.5s. That is a 23 percent latency penalty on the escalated queries. For interactive workloads with a strict tail latency SLO, this cost can outweigh the cost savings.
Streaming makes cascades awkward. A cascade cannot easily stream tokens back to the client because it does not know if the answer will be accepted until generation finishes. You can stream from the final stage unconditionally, which works if the tail is small, but it defeats the purpose if the small model was supposed to handle 80 percent of traffic.
Where Cascades Make Sense
Batch workloads with loose latency requirements are the best fit. Document processing, bulk summarization, offline classification, and data pipelines all tolerate the extra latency of escalation and benefit directly from the cost savings. A 10x reduction in inference spend on a nightly batch job is real money.
High-volume, low-diversity workloads also fit well. If you are processing millions of support tickets, most of them are going to look similar to each other. The small model handles the common patterns, and the cascade only escalates on the unusual cases. Scoring functions are easier to build too, because you can train a classifier on a large labeled set from within your own domain.
Retrieval-heavy tasks can benefit when the small model is good enough given good context. If your RAG pipeline is retrieving accurate documents, a 7B model with those documents in context often matches a 70B model answering without them. A cascade here starts with the small model plus retrieved context, and only escalates when the small model hedges or declines to answer.
Where Cascades Hurt
Interactive chat with strict latency requirements is a poor fit. The latency penalty on escalated queries, combined with the streaming problem, makes cascades hard to justify over just picking a good model and serving it directly.
Agentic workloads with many sequential LLM calls also suffer. If each step in a ten-step agent goes through a cascade, the escalation latency compounds. A 23 percent penalty per step is a 2.5x total latency blowup across the full trajectory, which is usually worse than just using a slightly bigger model for every step.
Safety-critical applications where being wrong is expensive do not tolerate the inherent quality risk of accepting small-model answers. Even at 95 percent scoring precision, 5 percent of accepted answers will be wrong. For customer-facing medical advice, legal reasoning, or high-stakes decisions, the cheaper answer is not worth the occasional silent failure.
Practical Implementation Notes
If you are building a cascade, start by measuring the distribution of your queries. Label a few thousand representative queries with "small model correct" and "small model wrong" using a stronger model or human judgment. This gives you an upper bound on how much savings the cascade can produce.
Build the scoring function next. Try the cheapest thing first: log-probs from the generating model, maybe combined with a format check if you have structured output. If those do not hit the precision you need, train a small judge on your labeled data. Do not start with a judge unless you have the data to train it well.
Measure the scoring function on a held-out set. Precision matters more than recall. A judge that rejects 30 percent of good answers costs you money (you escalate too often), while a judge that accepts 10 percent of bad answers costs you quality. Tune the accept threshold to optimize for whichever is more costly in your domain.
Fall back gracefully. If the final stage of the cascade produces an answer that the scoring function also rejects, return it anyway rather than failing the request. The alternative is to have queries that no model in the pipeline can satisfy, which is almost always worse than returning the best available attempt.
Monitor the escalation rate over time. Query distributions drift, and a cascade that was saving 70 percent last quarter might only be saving 40 percent now. If the escalation rate creeps up, the scoring function probably needs retraining, or the small model needs fine-tuning on the new query types.
Commercial Products and Where the Field Is Going
Several companies have built routing and cascade products as commercial offerings. Martian, Unify, and Not Diamond each ship some variant of "one API that routes across many underlying models." RouteLLM (from the LMSYS team) is an open-source reference that teams can self-host. Major inference providers have also started shipping built-in routing features that pick the right-sized model for each query.
The research direction is converging on learned routers that are trained end-to-end on the joint objective of cost and quality, rather than hand-tuned pipelines. Recent work also explores using a single model with early-exit layers as an implicit cascade, getting some of the same savings without the operational overhead of managing multiple models.
At General Compute, we serve a wide range of model sizes on our ASIC infrastructure. Customers building cascades often run the small and large stages of their pipeline on our API and use the cost savings to afford a stronger final-stage model than they could otherwise justify. If you are thinking about a cascade for your own workload, the latency we deliver at each stage makes the trade-offs easier: a faster small model means less total latency on escalated queries, which shifts the break-even point in favor of the cascade. Take a look at the docs if you want to see per-token speeds across sizes.