# General Compute — full text > Concatenated markdown for every page on www.generalcompute.com. Index and > per-page links: https://www.generalcompute.com/llms.txt # General Compute > GPUs were built for graphics. We were built for inference. The world's fastest inference provider. Purpose-built ASICs, 1,000+ tokens per second, up to 7x faster than GPU-based competitors. $200 free credit on signup. ## Key facts - API base URL: `https://api.generalcompute.com/v1` - Protocol: OpenAI-compatible HTTPS - Agent signup: https://docs.generalcompute.com/agent-signup - Docs: https://docs.generalcompute.com - Contact: founders@generalcompute.com ## For autonomous agents Agents can sign up on their own — there is no human-only gate. - Full signup walkthrough: https://docs.generalcompute.com/agent-signup - OpenClaw integration: https://docs.generalcompute.com/openclaw - Machine-readable skills: https://www.generalcompute.com/.well-known/agent-skills/index.json - API catalog (RFC 9727): https://www.generalcompute.com/.well-known/api-catalog ## Performance - Time to first token: under 300 ms - Throughput: 1,000+ tokens per second - Up to 7x faster than competitors - Uptime SLA: 99.9% ## How we're faster and cheaper **Faster:** - Purpose-built AI accelerators, not general-purpose GPUs - Software stack tuned end-to-end for inference - Disaggregated architecture that scales each resource independently **Cheaper:** - $0.035/kWh energy vs $0.13 US average - 7x more energy-efficient accelerators (17 kW vs 120 kW) - Air-cooled — no expensive liquid cooling ## Products - **API access** — REST API, OpenAI-compatible endpoints, single API key - **Custom deployments** — dedicated infrastructure with SLAs and capacity guarantees - **Bring your own model** — deploy your weights on our optimized stack ## Quick start ```python from openai import OpenAI client = OpenAI( base_url="https://api.generalcompute.com/v1", api_key="your-gc-api-key", ) response = client.chat.completions.create( model="gpt-oss-120b", stream=True, messages=[{"role": "user", "content": "Hello"}], ) ``` ## Links - Blog: https://www.generalcompute.com/blog - Benchmarks: https://www.generalcompute.com/benchmarks - Team: https://www.generalcompute.com/team - Terms of Service: https://www.generalcompute.com/terms - Privacy Policy: https://www.generalcompute.com/privacy - Sitemap: https://www.generalcompute.com/sitemap.xml --- # Agents portal This page exists specifically for automated consumers. ## Resources - Markdown version of this site: request any page with `Accept: text/markdown` - Machine-readable skills: https://www.generalcompute.com/.well-known/agent-skills/index.json - API catalog: https://www.generalcompute.com/.well-known/api-catalog - Sitemap: https://www.generalcompute.com/sitemap.xml - Docs: https://docs.generalcompute.com ## Machine rules - Respect robots.txt. We allow AI crawlers on all public marketing pages. - Set a descriptive User-Agent string so we can debug traffic quickly. - Sign up programmatically at https://docs.generalcompute.com/agent-signup. --- # General Compute developer resources Everything you need to integrate with General Compute: the OpenAI-compatible API, the OpenAPI specification, authentication, webhooks, the MCP server, and official SDKs. Agents can sign up programmatically with no human gate. ## Documentation - API documentation: https://docs.generalcompute.com - API reference: https://www.generalcompute.com/api-reference - OpenAPI specification: https://docs.generalcompute.com/openapi.json (landing: https://www.generalcompute.com/openapi) - Authentication & API keys: https://docs.generalcompute.com/authentication (landing: https://www.generalcompute.com/auth) - Webhooks: https://www.generalcompute.com/webhooks - MCP server: https://www.generalcompute.com/mcp - Agent signup: https://docs.generalcompute.com/agent-signup ## SDKs - Python: `pip install generalcompute` — https://pypi.org/project/generalcompute/ - Node: `npm install generalcompute` ## Machine-readable resources - https://www.generalcompute.com/llms.txt - https://www.generalcompute.com/llms-full.txt - https://www.generalcompute.com/.well-known/agent-skills/index.json - https://www.generalcompute.com/.well-known/api-catalog - https://www.generalcompute.com/.well-known/mcp - https://www.generalcompute.com/sitemap.xml ## Quick start ```python from openai import OpenAI client = OpenAI( base_url="https://api.generalcompute.com/v1", api_key="your-gc-api-key", ) response = client.chat.completions.create( model="gpt-oss-120b", stream=True, messages=[{"role": "user", "content": "Hello"}], ) ``` --- # General Compute API reference The General Compute API is OpenAI-compatible. Point any OpenAI SDK at the General Compute base URL and the rest of your code keeps working. ## Endpoint - Base URL: `https://api.generalcompute.com/v1` - Authentication: `Authorization: Bearer ` - Protocol: HTTPS, JSON request and response bodies - Default model: gpt-oss-120b ## Example request ```bash curl https://api.generalcompute.com/v1/chat/completions \ -H "Authorization: Bearer $GC_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-oss-120b", "messages": [{"role": "user", "content": "Hello"}] }' ``` ## Links - Full docs: https://docs.generalcompute.com - OpenAPI spec: https://www.generalcompute.com/openapi - Authentication: https://www.generalcompute.com/auth - Developer hub: https://www.generalcompute.com/developers --- # General Compute OpenAPI specification The full OpenAPI 3 specification for the General Compute inference API lives on the docs domain. Use it to generate clients, validate requests, or feed it to an agent. ## Where to fetch it - Canonical URL: https://docs.generalcompute.com/openapi.json - Format: OpenAPI 3.1 JSON - Content-Type: application/json ## Fetch with curl ```bash curl https://docs.generalcompute.com/openapi.json \ -H "Accept: application/json" \ -o generalcompute-openapi.json ``` ## Links - API reference: https://www.generalcompute.com/api-reference - Full docs: https://docs.generalcompute.com - Developer hub: https://www.generalcompute.com/developers --- # General Compute authentication Every request to the General Compute API is authenticated with a Bearer token. Keys are issued per account, can be rotated at any time, and never expire on their own. ## Header format ``` Authorization: Bearer ``` Keep your API key server-side. Never embed it in client-side JavaScript. ## How to get a key - Humans: https://docs.generalcompute.com/authentication - Agents (programmatic, no human gate): https://docs.generalcompute.com/agent-signup - New accounts get $200 in free inference credit. ## Links - API reference: https://www.generalcompute.com/api-reference - Developer hub: https://www.generalcompute.com/developers --- # General Compute webhooks General Compute can push events to a URL you control. Use webhooks to react to billing changes, account updates, and long-running inference jobs without polling. ## Event categories - Account: key created, rotated, revoked. - Billing: invoice generated, credit threshold reached. - Inference: long-running job completed or failed. ## Signing and retries - Every delivery is signed with an HMAC-SHA256 header. Verify before trusting. - Failed deliveries retry with jittered exponential backoff for up to 24 hours. - Receivers must respond HTTP 2xx within 10 seconds or the delivery is failed. ## Links - Full docs: https://docs.generalcompute.com - Developer hub: https://www.generalcompute.com/developers --- # General Compute MCP server The Model Context Protocol gives MCP-aware agents a structured way to connect to General Compute. ## Endpoint - Canonical URL: https://mcp.generalcompute.com - Transport: streamable HTTP (JSON-RPC 2.0) - Authentication: `Authorization: Bearer ` ## Discovery descriptor A small JSON descriptor lives at https://www.generalcompute.com/.well-known/mcp so agent-discovery tooling can resolve the canonical endpoint, transport, and docs URL in one fetch. ## Links - Full docs: https://docs.generalcompute.com - Developer hub: https://www.generalcompute.com/developers - Authentication: https://www.generalcompute.com/auth --- # General Compute SDKs Official client libraries for the General Compute inference API. Both wrap the OpenAI-compatible HTTPS API and add typed helpers. ## Python Install: ```bash pip install generalcompute ``` Usage: ```python from generalcompute import GeneralCompute client = GeneralCompute(api_key="your-gc-api-key") response = client.chat.completions.create( model="gpt-oss-120b", messages=[{"role": "user", "content": "Hello"}], ) ``` Package: https://pypi.org/project/generalcompute/ ## Node Install: ```bash npm install generalcompute ``` Usage: ```javascript import GeneralCompute from "generalcompute"; const client = new GeneralCompute({ apiKey: process.env.GC_API_KEY }); const response = await client.chat.completions.create({ model: "gpt-oss-120b", messages: [{ role: "user", content: "Hello" }], }); ``` Package: https://www.npmjs.com/package/generalcompute ## Links - API reference: https://www.generalcompute.com/api-reference - Developer hub: https://www.generalcompute.com/developers - Full docs: https://docs.generalcompute.com --- # OpenClaw integration If you are OpenClaw (or any OpenAI-compatible coding agent), you can swap your inference provider to General Compute and keep your existing tool calls working. - Walkthrough: https://docs.generalcompute.com/openclaw - Skill definition: https://www.generalcompute.com/.well-known/agent-skills/openclaw-connect/SKILL.md ## Settings - `OPENAI_BASE_URL=https://api.generalcompute.com/v1` - `OPENAI_API_KEY=` - `temperature=0.2`, `max_tokens=900`, `stream=true` Retry HTTP 529 with jittered exponential backoff (rare, >500 RPS only). --- # Benchmarks Benchmarking real-world LLM inference benchmarks. Model: GPT-OSS-120B. All measurements include network overhead. ## Head-to-head vs Together AI - Time to first token: **738 ms** (General Compute) vs **1,899 ms** (Together AI) — 2.6x faster - End-to-end latency: **1.76 s** vs **8.05 s** — 4.6x faster ## Methodology - Identical prompts sent to all providers simultaneously. - Workload categories: short (50 tokens) and long (1,000 tokens). - Metrics: TTFT, end-to-end latency, and pure generation rate. - All measurements include network overhead. ## Links - Full interactive benchmarks: https://www.generalcompute.com/benchmarks --- # Infrastructure We operate purpose-built inference clusters co-located with cheap renewable energy. - Purpose-built AI accelerators (ASICs), not repurposed GPUs. - Energy: $0.035/kWh vs US average of $0.13/kWh. - 7x more energy-efficient accelerators (17 kW vs 120 kW per unit). - Air-cooled — no liquid cooling overhead. - Disaggregated architecture: each resource scales independently. Reach out for custom deployments: founders@generalcompute.com. --- # Coding agent workload profile - Median completion: 8.6 s/request. - Cost delta: 60% cheaper vs hyperscalers. - 7x faster than Together AI on 1k-token multi-step patches. - Input cost: $0.45 / 1M tokens. Output cost: $0.60 / 1M tokens. ## Recommended settings - `temperature=0.2`, `max_tokens=900`, `stream=true`. - Retry HTTP 529 with jittered exponential backoff (rare, >500 RPS only). Works with Codex, Claude Code, OpenCode, Cursor, Aider, and any autonomous coding agent. --- # Voice AI workload profile - Median latency: 520–800 ms. - Bidirectional streaming via WebRTC (beta). - Silence detection threshold: 120 ms (adjust for non-English flows). - Short contexts (<2k tokens) produce better conversational pacing. Fall back to the text-only endpoint during rare voice-infra maintenance windows. --- # Roadmap - Active site: Paraguay hydro surplus. - Next: Iceland and Quebec scouting. ## Playbook 1. Locate markets with stranded renewable power. 2. Deploy ASIC clusters, connect via redundant fiber, expose through the same API. 3. Repeat per geography; keep latency low by routing traffic regionally. --- # Team Primary operators to contact if you need rate-limit changes or escalation. - Jason Goodison — infrastructure and roadmap — jason@generalcompute.com - Sales hotline — +1 (415) 915-9370 (PT business hours) - General contact — founders@generalcompute.com --- # Demo The interactive demo requires JavaScript and authenticated access. - Live demo: https://www.generalcompute.com/demo - Programmatic benchmarking: email founders@generalcompute.com for a token. --- # Blog 39 posts on inference performance, model serving, and agent infrastructure. - [Streaming for Agents: Why Partial Results Change the UX](https://www.generalcompute.com/blog/streaming-for-agents-why-partial-results-change-the-ux) — Streaming in agentic pipelines is not the same as streaming chat tokens. Partial tool calls, pipelined steps, and early cancellation change what the user experiences. (2026-05-16) - [Parallel Tool Execution: How Fast Inference Enables Concurrent Agent Actions](https://www.generalcompute.com/blog/parallel-tool-execution-how-fast-inference-enables-concurrent-agent-actions) — Why running multiple tool calls in parallel changes the latency math of an agent, and how inference speed determines whether the parallelism is worth doing. (2026-05-15) - [Agent Memory Systems: Balancing Context Length vs Retrieval Latency](https://www.generalcompute.com/blog/agent-memory-systems-balancing-context-length-vs-retrieval-latency) — How agents reconstruct memory between turns, and the latency trade-offs between long context, RAG, summarization, and KV cache reuse. (2026-05-12) - [Building a Code Agent: Why Each Step Needs Sub-Second Inference](https://www.generalcompute.com/blog/building-a-code-agent-why-each-step-needs-sub-second-inference) — A practical breakdown of the latency budget inside a code agent, step by step, and why every link in the chain needs to land under a second to keep the loop usable. (2026-05-11) - [ReAct, Reflexion, and Chain-of-Thought: The Inference Cost of Reasoning Patterns](https://www.generalcompute.com/blog/react-reflexion-and-chain-of-thought-the-inference-cost-of-reasoning-patterns) — Popular agent reasoning patterns are described as prompt techniques, but they are inference cost multipliers. Here is how ReAct, Reflexion, and Chain-of-Thought actually shape the bill and the latency. (2026-05-10) - [Multi-Agent Architectures and the Inference Cost Explosion](https://www.generalcompute.com/blog/multi-agent-architectures-and-the-inference-cost-explosion) — Orchestrator and worker patterns make multi-agent systems easy to design and expensive to run. Here is where the inference cost actually goes, and what it means for the infrastructure underneath. (2026-05-09) - [Tool Calling Latency: The Bottleneck No One Talks About](https://www.generalcompute.com/blog/tool-calling-latency-the-bottleneck-no-one-talks-about) — Function calling looks simple on paper, but the latency budget of a tool-using LLM is dominated by short structured generations that most serving stacks are not optimized for. This is what actually makes tool calls feel slow. (2026-05-08) - [The Agentic Inference Tax: Why Agents Need 10x Faster Models](https://www.generalcompute.com/blog/the-agentic-inference-tax) — Agents make many sequential LLM calls per task, and each one pays the full latency of decoding. This post walks through how that compounds and why fast inference changes which agents are even viable. (2026-05-07) - [Compiler-Level Optimizations for Inference: TorchInductor, Triton, XLA](https://www.generalcompute.com/blog/compiler-level-optimizations-for-inference) — How modern ML compilers turn Python model code into fused, fast kernels. A practical look at TorchInductor, Triton, and XLA, and the tradeoffs each one makes for inference. (2026-05-06) - [Draft Model Selection for Speculative Decoding](https://www.generalcompute.com/blog/draft-model-selection-for-speculative-decoding) — Picking a draft model is the most consequential decision when deploying speculative decoding. A practical guide to acceptance rates, sizing, and the tradeoffs that decide whether you actually get a speedup. (2026-05-05) - [The Attention Sink Phenomenon: Why the First Token Matters](https://www.generalcompute.com/blog/the-attention-sink-phenomenon-why-the-first-token-matters) — How attention concentrates on the first few tokens of every sequence, why naive sliding-window caching breaks long-context generation, and how StreamingLLM uses sink tokens to serve effectively unbounded streams. (2026-05-04) - [Mixture of Experts at Inference Time](https://www.generalcompute.com/blog/mixture-of-experts-at-inference-time) — How MoE routing actually works during serving, why sparse activation makes large models cheaper to run per token, and what changes for the inference stack. (2026-05-03) - [Tensor Parallelism vs Pipeline Parallelism for Model Serving](https://www.generalcompute.com/blog/tensor-parallelism-vs-pipeline-parallelism-for-model-serving) — How tensor and pipeline parallelism actually differ in production inference, when to use each, and why most serving stacks end up combining them. (2026-05-02) - [Prefix Caching: Why Repeated Prompts Shouldn't Cost You Twice](https://www.generalcompute.com/blog/prefix-caching-why-repeated-prompts-shouldnt-cost-you-twice) — How prefix caching works in modern LLM serving stacks, why it changes the economics of long system prompts and RAG, and what to watch out for in production. (2026-05-01) - [Distillation for Inference: How Smaller Models Learn From Larger Ones](https://www.generalcompute.com/blog/distillation-for-inference-how-smaller-models-learn-from-larger-ones) — A practical guide to knowledge distillation for production inference: what actually works, what to skip, and how to ship a smaller model without losing the behavior you cared about. (2026-04-30) - [FP8 Training and Inference: The Precision Sweet Spot](https://www.generalcompute.com/blog/fp8-training-and-inference-the-precision-sweet-spot) — Why 8-bit floating point hits a different point on the accuracy/throughput curve than INT8, how E4M3 and E5M2 are used in practice, and what FP8 actually buys you in production serving. (2026-04-29) - [Activation-Aware Quantization (AWQ) Deep Dive](https://www.generalcompute.com/blog/activation-aware-quantization-awq-deep-dive) — A close look at how AWQ picks salient weight channels, applies per-channel scaling, and why it consistently beats round-to-nearest 4-bit quantization for LLM inference. (2026-04-28) - [Mamba and State Space Models: Inference Without Attention](https://www.generalcompute.com/blog/mamba-and-state-space-models-inference-without-attention) — How structured state space models like Mamba achieve constant-time per-token inference, and why the selective scan changes the trade-off space for long-context serving. (2026-04-27) - [RWKV and Linear Attention: Recurrent Models as an Inference Shortcut](https://www.generalcompute.com/blog/rwkv-and-linear-attention-recurrent-models-as-an-inference-shortcut) — How RWKV and linear attention architectures collapse the per-token cost of generation to O(1), and what that means for serving long-context workloads. (2026-04-26) - [Dynamic Batching Strategies: From Naive to Continuous to Iteration-Level](https://www.generalcompute.com/blog/dynamic-batching-strategies-from-naive-to-continuous-to-iteration-level) — Batching is the lever that turns idle GPU silicon into served tokens. This post walks through the evolution of batching for LLM serving, from one-at-a-time to static batches to request-level dynamic batching to iteration-level continuous batching, and shows where each strategy still leaves throughput on the floor. (2026-04-25) - [Token Merging and Token Pruning for Faster Transformers](https://www.generalcompute.com/blog/token-merging-and-token-pruning-for-faster-transformers) — Attention cost grows with the square of sequence length. Token merging and token pruning shrink that sequence mid-network, trading a little accuracy for real speedups. Here is how ToMe works, how the idea extends to language models, and where it breaks down. (2026-04-24) - [S3: Scheduling for Straggler Mitigation in LLM Serving](https://www.generalcompute.com/blog/s3-scheduling-for-straggler-mitigation-in-llm-serving) — In LLM serving, a single long-running request can stall everyone else sharing the same batch. S3 attacks that by predicting output length and scheduling around it. Here is what stragglers actually cost you, and how output-length-aware scheduling helps. (2026-04-23) - [Chunked Prefill: Overlapping Compute and Communication](https://www.generalcompute.com/blog/chunked-prefill-overlapping-compute-and-communication) — Prefill pins the compute units while decode starves for memory bandwidth. Sarathi-Serve splits prefill into chunks and piggybacks decodes on them, keeping both resources busy in the same batch. Here is how it works and where the limits are. (2026-04-22) - [Cascade Inference: Using Small Models to Route to Big Ones](https://www.generalcompute.com/blog/cascade-inference-using-small-models-to-route-to-big-ones) — FrugalGPT and its descendants show that most queries do not need the biggest model. We walk through the cascade pattern, routing classifiers, and the engineering trade-offs of sending easy work to cheap models and escalating only when needed. (2026-04-21) - [Lookahead Decoding: Parallel Token Generation Without Draft Models](https://www.generalcompute.com/blog/lookahead-decoding-parallel-token-generation-without-draft-models) — Lookahead decoding from LMSYS speeds up autoregressive generation without requiring a draft model. We walk through the Jacobi iteration trick, the n-gram pool, and what the speedups actually look like in practice. (2026-04-20) - [Disaggregated Prefill and Decode (Splitwise / DistServe)](https://www.generalcompute.com/blog/disaggregated-prefill-and-decode) — Prefill and decode have different compute profiles and clash when they share a GPU. Splitwise and DistServe separate them onto different hardware pools. We walk through why, how, and when it actually pays off. (2026-04-19) - [KV Cache Compression: MLA and Beyond](https://www.generalcompute.com/blog/kv-cache-compression-mla-and-beyond) — DeepSeek's Multi-Head Latent Attention cuts the KV cache by an order of magnitude without giving up quality. We walk through MLA, how it compares to MQA and GQA, and the other compression techniques worth knowing. (2026-04-18) - [Ring Attention: Scaling Context to Millions of Tokens](https://www.generalcompute.com/blog/ring-attention-scaling-context-to-millions-of-tokens) — Ring Attention distributes the attention computation across devices in a ring topology, overlapping KV transfer with compute so context length scales linearly with the number of GPUs. (2026-04-17) - [Quantization for Inference: GPTQ, AWQ, SmoothQuant, and FP8](https://www.generalcompute.com/blog/quantization-for-inference-gptq-awq-smoothquant-fp8) — Quantization shrinks model weights from 16-bit to 4-bit or 8-bit, cutting memory usage and speeding up inference. Here's how the major techniques work and when to use each one. (2026-03-26) - [Multi-Query and Grouped-Query Attention: Shrinking the KV Cache](https://www.generalcompute.com/blog/multi-query-grouped-query-attention) — MQA and GQA reduce the memory footprint of attention by sharing key-value heads across queries. A simple architectural change that makes inference dramatically faster. (2026-03-25) - [Continuous Batching: The Orca Paper That Changed LLM Serving](https://www.generalcompute.com/blog/continuous-batching-the-orca-paper) — Before continuous batching, LLM servers wasted GPU cycles waiting for the slowest request in each batch. Orca's iteration-level scheduling fixed this with a 36x throughput improvement. (2026-03-24) - [Medusa, EAGLE, and Sequoia: The Next Generation of Speculative Decoding](https://www.generalcompute.com/blog/medusa-eagle-sequoia-next-gen-speculative-decoding) — The original speculative decoding papers needed a separate draft model. Medusa, EAGLE, and Sequoia found ways to speculate faster, smarter, and without the extra model. (2026-03-24) - [SGLang and RadixAttention: Smarter KV Cache Reuse](https://www.generalcompute.com/blog/sglang-and-radix-attention) — SGLang's RadixAttention stores KV cache in a radix tree, enabling automatic prefix sharing across requests. The result is up to 5x higher throughput for multi-turn and structured workloads. (2026-03-24) - [Speculative Decoding: Getting 3x Speedups Without Changing the Model](https://www.generalcompute.com/blog/speculative-decoding-3x-speedups-without-changing-the-model) — Speculative decoding uses a small draft model to predict multiple tokens ahead, then verifies them all at once. The result is mathematically identical output, 2-3x faster. (2026-03-23) - [PagedAttention and vLLM: Virtual Memory for LLM Serving](https://www.generalcompute.com/blog/paged-attention-and-vllm) — The PagedAttention paper solved the biggest memory waste problem in LLM serving by borrowing an idea from operating systems. Here's how it works and why vLLM became the default serving framework. (2026-03-22) - [FlashAttention: How Tri Dao Made Attention 4x Faster](https://www.generalcompute.com/blog/flash-attention-how-tri-dao-made-attention-4x-faster) — FlashAttention rewrote the rules of transformer inference by treating attention as a memory problem, not a compute problem. Here's how it works and why it matters. (2026-03-21) - [Build a Real-Time Voice AI Agent with General Compute](https://www.generalcompute.com/blog/build-a-real-time-voice-ai-agent) — A step-by-step tutorial for building a voice AI agent with sub-500ms response times. Plus: why General Compute is the only provider fast enough to use reasoning models in a voice pipeline. (2026-03-20) - [How Coding Agents Depend on Inference Speed](https://www.generalcompute.com/blog/how-coding-agents-depend-on-inference-speed) — Coding agents make dozens of sequential LLM calls per task. Every millisecond of inference latency compounds across each step, making speed the single biggest infrastructure bottleneck for AI-powered developer tools. (2026-03-19) - [Why Inference Speed is the New Moat](https://www.generalcompute.com/blog/why-inference-speed-is-the-new-moat) — Model quality has commoditized. The real competitive advantage in AI is how fast your infrastructure can deliver results. Inference speed is becoming the defining moat for AI-native products. (2026-03-18) --- # Terms of service — agent summary - Last updated: March 31, 2026 - Legal contact: legal@generalcompute.ai ## Key clauses - Usage subject to platform TOS; scraping permitted for evaluation. - API keys must not be shared or stored client-side. - General Compute may rate-limit or terminate access for abuse. Full terms: https://www.generalcompute.com/terms --- # Privacy — agent summary - Data contact: privacy@generalcompute.ai - Scope: website and inference API ## Highlights - Data types collected: account metadata, API usage metrics, support correspondence. - No selling of personal data; processors limited to essential services. - Request data deletion or export via privacy@generalcompute.ai. Full policy: https://www.generalcompute.com/privacy --- # Streaming for Agents: Why Partial Results Change the UX > Streaming in agentic pipelines is not the same as streaming chat tokens. Partial tool calls, pipelined steps, and early cancellation change what the user experiences. - Author: General Compute - Published: 2026-05-16 - Tags: agents, streaming, inference, latency, ux, tool-calling - Canonical: https://www.generalcompute.com/blog/streaming-for-agents-why-partial-results-change-the-ux --- Streaming, in the chat product sense, is a solved idea. The model emits tokens one at a time, the client appends them to a textarea, and the user reads along while the model is still thinking. The win is psychological: nothing is faster, but the wait feels shorter, and a partially-rendered answer is sometimes enough to let the user decide the answer is wrong and stop the generation. Streaming in an agentic pipeline is a different problem. An agent is not just writing prose to a screen. It is calling tools, parsing structured output, deciding what to do next, and sometimes passing intermediate state to another model call. The "user" of the stream is often another piece of code, not a human eye. Once you accept that, a lot of options open up that chat-style streaming never needed: streaming a tool call's arguments while it is still being decoded, pipelining a downstream step against an upstream one, cancelling early when a partial result is already enough, and surfacing structured progress to the human watching the agent run. This post is about what changes when you stream in an agentic system instead of a chat one, and why partial results are worth the complexity. ## What "streaming" means in different layers Inference servers stream tokens. That part has not changed. The OpenAI-style `stream: true` flag still produces a sequence of server-sent events, each carrying a delta. Anthropic's streaming format does the same thing with a different schema. Most other vendors follow one of these two shapes. What has changed is what an agent does with those events. In a chat product, the consumer of the stream is a renderer. It concatenates deltas, runs them through a markdown parser, and paints them on a screen. The agent layer is invisible because there is no agent layer. In an agentic system, the stream feeds at least three different consumers, often at the same time: - A user interface, if there is a human watching, which wants something human-readable to display. - A tool dispatcher, which is watching for the model to emit a tool call so it can start executing it. - An orchestrator, which is deciding whether the model's output is good enough to move to the next step or whether it should be cancelled and retried. Each of these consumers has a different definition of "useful partial result." The renderer wants tokens. The tool dispatcher wants a complete function name and a parseable arguments object. The orchestrator wants enough output to evaluate confidence. A well-built agent treats the stream as a multi-consumer event source, not as a string that is slowly getting longer. ## Partial tool calls The most interesting use of streaming in agents is starting tool execution before the tool call is fully decoded. When a model emits a tool call, it does not produce the function name and arguments atomically. It generates them as text, like everything else. The function name comes out token by token, then the arguments, which are usually JSON. With current models and current serving stacks, this can take anywhere from 50 ms to several seconds, depending on argument length. If your agent waits for the full tool call before doing anything, you are paying that decoding time twice: once for the model to finish generating, and again later when the tool actually runs. If the tool itself is slow (a web search, a database query, a code execution sandbox), the user is waiting end-to-end. There are two patterns that recover some of this time. The first is **speculative dispatch on function name**. As soon as the function name is decoded but the arguments are still streaming, you can warm up the tool: open a database connection, load the model needed for the tool, fetch credentials. None of this depends on the arguments. By the time the arguments are fully decoded, the tool is already primed. The second is **partial-argument execution** for tools that allow it. A web search tool whose argument is a query string can start tokenizing and embedding the query as soon as the first few tokens of the query are decoded. If the model decodes `"query": "fastest open source LLM"` character by character, you can begin the search index lookup at `"fastest open source"` and refine when the rest arrives. For tools where the partial result is wrong but cheap to compute, this is worth it. For tools where partial input is destructive (a write, an email send, a payment), do not do this. Both patterns require the inference server to actually stream tool call deltas. Some serving stacks do, some do not. The OpenAI Chat Completions API has supported tool call deltas for a while now, and Anthropic's streaming format includes incremental input JSON deltas for tool use blocks. If you are running open models behind vLLM or SGLang, check that the tool-calling parser is configured to emit deltas, not whole calls. ## Pipelining agent steps A multi-step agent looks like a small DAG. Step 1 produces output. Step 2 consumes that output and produces its own. Step 3 consumes step 2's output. In the simplest implementation, step 2 waits for step 1 to finish, step 3 waits for step 2, and the user waits for the whole chain. When step 1 streams, you can sometimes start step 2 earlier. The catch is that you need to know which prefix of step 1's output is enough. Consider a plan-then-execute agent. Step 1 produces a numbered list of subtasks. Step 2 is "for each subtask, dispatch a worker". If step 1 streams its plan, step 2 can start dispatching workers as soon as the first numbered item finishes streaming, without waiting for the whole plan. This is straightforward when the output structure is line-oriented. It gets harder when the downstream step needs to reason globally about the upstream output. A summarizer that picks the three most important items from a list of ten cannot start until it has seen all ten. A coder that writes a function based on a spec cannot start before the spec is complete. For those cases, streaming saves the user-visible latency for the first step but does not pipeline anything underneath. The pattern worth borrowing from systems work: treat each agent step as having a "minimum prefix" that downstream consumers depend on. If a downstream consumer can run with a prefix, run it on the prefix. If not, do not pretend that streaming helps; it just gives the user something to look at. ## Streaming structured output Most production agents output some kind of structured data. JSON, YAML, function arguments, structured tool calls. The naive approach is to wait for the whole blob, parse it, and act. With streaming, you can do better, but partial JSON is its own problem. Partial JSON is not valid JSON. `{"name": "ali` is not parseable. There are a few approaches that work in practice. The first is a **partial JSON parser** that builds up the tree as tokens arrive and exposes the latest valid prefix. Libraries like `partial-json` for TypeScript and the equivalent in Python implement this. When you ask for the parsed object, you get the deepest interpretable structure: missing keys are omitted, unterminated strings are surfaced as-is. You can poll this on every delta and decide whether enough of the structure is present to act. The second is **constrained decoding** at the model level. If you have control over the inference stack, you can constrain the model to emit valid JSON token by token, with grammar enforcement (xgrammar, llguidance, outlines, lm-format-enforcer). At every step, the output is well-formed, which means the partial JSON parser does not have to handle most edge cases. This also tends to be faster, since the model is not wasting tokens on syntax recovery. The third is to **stream the keys in a known order**, so the consumer can rely on positional structure. If your function signature guarantees that `query` is always emitted before `filters` and `filters` before `limit`, your consumer can act on `query` the moment the next key appears, without rebuilding the object from scratch. The combination of constrained decoding and a partial parser is the production-grade choice. It is rarely the default; you have to set it up. ## Cancellation and correction mid-stream Streaming makes it possible to stop a generation that is going wrong. In chat, this is the user smashing the stop button. In an agent, the orchestrator can do the same thing, automatically, when a partial result is evidence the model is heading off the rails. A few examples where this is worth doing: - The model emits a tool name that does not exist. Stop, do not let it finish hallucinating arguments. Retry with a tool-listing hint. - The model starts emitting a long reasoning chain in a step that was supposed to be a one-shot answer. Stop, retry with a stricter system prompt. - The model's confidence proxy (token logprobs, a classifier on the partial output, a small validator model) drops below a threshold. Stop, escalate to a larger model. This is only useful if cancellation is cheap. With most inference servers, you cancel by closing the SSE connection, and the server stops generating shortly after. The exact behavior depends on the server: vLLM and SGLang both honor client disconnects in recent versions, but the latency from disconnect to actual stop varies. Production agents with cancellation logic should measure this on their own stack rather than trusting docs. Cancellation also matters for cost. A model that is generating 2000 tokens of reasoning before noticing it has the wrong tool is wasting both wall time and money. An orchestrator that watches the stream and cancels at token 200 saves both. ## Streaming progress to humans A long-running agent that takes 30 seconds to a few minutes per task has a UX problem that chat does not. The user is staring at a spinner. They do not know if the agent is making progress, stuck in a loop, or about to produce something useful. Streaming the agent's internal state to the user is a partial fix. The pattern that has emerged in coding agents and research agents is to surface a structured progress event for each agent step: which tool is being called, what the current plan looks like, what files have been touched. This is not the same as token streaming; the events are higher-level. But they typically piggyback on the same underlying connection, and they are only possible because the model is streaming its decisions, not batching them into one final answer. If you are building an agent UI, the question to ask is not "should I stream tokens?" but "what is the smallest useful unit of progress I can show the user?" Sometimes that is a token. Sometimes it is a tool name. Sometimes it is "step 3 of 7 complete." A mix of all three, with the right one chosen for the right step, is what feels responsive. ## Where inference speed matters most Faster per-token inference helps streaming agents in three different places. First, it shortens the absolute time from the start of a step to the first useful partial result. If your agent uses partial JSON parsing to start a downstream tool early, faster decoding means that downstream tool starts earlier in wall time. Second, it makes cancellation cheaper. If a step that turns out to be wrong takes 500 ms instead of 5 seconds, the cost of a cancelled generation drops by an order of magnitude. Cancellation-based retry strategies are only viable when retries are fast. Third, it changes the design space of multi-step agents. When each LLM call is a fraction of a second, you can afford more steps, more tool calls, more validation passes, all happening with streaming pipelines between them. The agent stops looking like a sequence of slow blocking calls and starts looking like a real pipeline, with each stage running concurrently with the next. This is the angle that matters for production work. Streaming is not a UI trick to make a chat product feel faster. It is the substrate that makes agentic pipelines compose well, and the faster the underlying inference, the more aggressive your pipeline design can be. If you are running agents on a stack that streams tokens at hundreds or low thousands per second, the patterns in this post are options. On a stack that runs at tens of tokens per second, most of them collapse back to "wait for the model to finish." That is the part worth measuring before you design your agent around streaming. General Compute serves open models with very high tokens-per-second on an OpenAI-compatible API, including streaming tool call deltas. If you are building an agent and want to test what your pipeline looks like when streaming is fast enough to pipeline against, the API is at [generalcompute.com](https://generalcompute.com). --- # Parallel Tool Execution: How Fast Inference Enables Concurrent Agent Actions > Why running multiple tool calls in parallel changes the latency math of an agent, and how inference speed determines whether the parallelism is worth doing. - Author: General Compute - Published: 2026-05-15 - Tags: agents, tool-calling, parallelism, inference, latency - Canonical: https://www.generalcompute.com/blog/parallel-tool-execution-how-fast-inference-enables-concurrent-agent-actions --- The standard mental model of an agent is a loop: the model emits a tool call, the tool runs, the result comes back, the model emits the next call. That loop is sequential by construction. Each step waits on the one before it. If the model takes a second to think and the tool takes half a second to run, eight steps cost twelve seconds before anyone sees an answer. Most modern model APIs let the model emit several tool calls in a single response. The agent runtime is then free to dispatch those calls concurrently and collect the results before going back to the model. This is called parallel tool execution, and it sounds like a free win. In practice the win depends almost entirely on how fast your inference is, because the model's decision to fan out is itself an inference call, and the rejoin step is another inference call. Fast inference is what makes the fan-out worth doing. This post walks through where the latency actually lives in a parallel tool-calling agent, the design patterns that show up in production, and the failure modes that kill the speedup if you are not careful. ## The shape of a parallel tool call A model that supports parallel tool calls returns a response that looks roughly like this: ```json { "tool_calls": [ { "name": "search_docs", "arguments": { "query": "rate limits" } }, { "name": "search_code", "arguments": { "query": "RateLimiter" } }, { "name": "get_user_settings", "arguments": { "user_id": "u_123" } } ] } ``` The runtime sees three independent calls. Nothing in their arguments depends on the others, so it dispatches them at the same time. When all three finish, the runtime appends their results to the conversation and sends it back to the model. The model now has three pieces of evidence at once instead of having had to ask for them across three round trips. The sequential version of this same task would have looked like: ``` Turn 1: model asks for docs Turn 2: docs returned, model asks for code Turn 3: code returned, model asks for settings Turn 4: settings returned, model writes the answer ``` Four model calls instead of two, and three sequential tool waits instead of one. If each model call takes 800 milliseconds and the tools take 300 milliseconds each, the sequential version costs around 4.4 seconds and the parallel version costs around 1.9 seconds. The savings come from two places: fewer round trips to the model, and tool latencies that overlap instead of stacking. ## Where the speedup actually comes from People often describe parallel tool calls as if the win comes from the tools themselves running faster. That is half the story. The bigger win, in most agent workloads, is collapsing the number of model calls. Every tool call in a sequential agent is bracketed by an inference call on each side. The model has to read the prior tool result, decide what to do next, and emit a new call. That decision step is pure inference latency, and it does not get faster if you make the tools faster. A loop of eight sequential tool calls is eight inference decisions. A fan-out of eight parallel tool calls is one inference decision plus one rejoin call. Two inference passes instead of eight. This is why inference speed matters so much for parallel agents. If inference is slow, you pay through the nose for each model decision and the fan-out savings get diluted. If inference is fast, the model decisions are cheap enough that you can afford to plan the parallelism on the fly and rejoin quickly. The rejoin call is interesting on its own. When eight tool results come back at the same time, the model has to read all eight before producing the next step. That prompt is now longer than it would have been in the sequential case, because in the sequential case the model only ever read one tool result at a time. Prefill cost on that combined prompt is part of the latency budget for the rejoin step. Fast prefill matters here in the same way fast decode matters for the planning step. ## What the model has to do to fan out The model cannot fan out into parallel calls by accident. It has to recognize that the calls are independent. That is a skill that varies a lot across models and across prompts. The clearest case is the one where the user asks for several things that obviously do not depend on each other. "Find the docs page for rate limits, look up our existing rate limiter implementation, and tell me the user's quota." A capable model will pattern match this as three independent retrieval calls and emit them in one response. Less capable models will still emit them sequentially even though the API supports parallelism, because their training distribution did not contain enough examples of fan-out tool use. The less clear case is when the model has to plan the parallelism. The user says "fix this bug." The agent has to decide whether to first look at the failing test, then look at the file, then look at git blame, or whether to ask for all three at once. A smart, fast model will fan out because the three look-ups are independent and the cost is the same either way. A weaker model will play it safe and ask for one at a time. This is one of the places where the underlying model's training matters. Anthropic, OpenAI, and several of the open models have leaned into parallel tool calling in their post-training. Models that have not been trained for it will technically support it through the API but will rarely use it. ## The dependency problem Parallel tool calls only work when the tools are actually independent. If call B's arguments depend on call A's result, you cannot run them at the same time. The agent has to recognize that dependency and serialize those two calls. Sometimes dependencies are obvious. "Get the user, then update the user." Sometimes they are not. "Search for an error, then check whether it is in our logs." If the search returns specific error IDs and the log lookup needs those IDs, the second call cannot start until the first finishes. A model that fans those out will end up with a useless second call that queries on stale or empty inputs. Most production agent frameworks let the model express the dependency. The model emits a call with a placeholder that the runtime fills in from the first result. Or the runtime parses the model's plan and notices that one call references the output of another. Or, more commonly, the runtime just trusts the model: if the model emits two calls in one response, the runtime assumes they are independent. If they were not, the model would have emitted them sequentially. The trust model is fine when the model is right, and bad when it is wrong. Wrong fan-out shows up as silent failure: a tool call ran on the wrong input, returned a plausible but irrelevant result, and the agent kept going as if everything was fine. This failure mode is hard to detect without good logging because nothing throws an error. ## Tool latency variance and the straggler problem Parallel tool execution is bound by the slowest call in the batch. If you fan out six calls and five return in 100 milliseconds while one takes two seconds, the model is waiting two seconds before it can rejoin. The average latency went down. The tail latency did not. This is a familiar problem from distributed systems. Stragglers dominate the latency of fan-out workloads, and the more calls you fan out, the worse the tail gets. The fix in distributed systems is hedging: send duplicate requests after a timeout, take whichever returns first. Hedging works for idempotent tool calls, like reads, and is dangerous for non-idempotent ones, like writes. There are softer mitigations that show up in agent runtimes. Speculative dispatch starts secondary calls before the model has confirmed they are needed, based on a guess from the agent runtime. Result streaming sends partial tool results back to the model as soon as the first call returns, so the model can start reasoning while the others finish. Tool call timeouts bound the worst case at the cost of returning incomplete data to the model. Each of these has its own complexity cost. The simplest improvement is on the tool side. Tools that have predictable latency distributions, narrow tail variance, and clear timeout behavior are friendlier to parallel execution than tools that vary wildly. If you control the tools, this is worth optimizing for. Capping individual tool latency at 1 to 2 seconds, and returning a graceful error past that, makes parallel agent execution far more predictable. ## Cost dynamics Parallel tool calls do not change the total inference cost much. The model still has to read every tool result, just bundled together instead of one at a time. The total prefill tokens across the agent are similar in the two regimes, sometimes slightly higher in the parallel case because the rejoin prompt has to fit all the results at once. Where parallel execution shifts cost is on the tool side. If a tool has a per-call overhead, you pay it more often when you fan out aggressively. If a tool is rate limited, parallelism can saturate the rate limit faster. If a tool is paid per call, you might end up issuing redundant calls that the agent would have skipped if it had been forced to read each result before deciding. This last failure mode is worth watching for. Sequential agents implicitly prune their own work. They see the first result, realize they do not need the second, and skip it. Parallel agents commit to a batch of calls before seeing any of the results. The model fanning out three calls might have only needed one if it had been forced to wait. The cost of that waste is the price of the saved latency, and whether it is worth paying depends on your unit economics. ## How fast inference changes the trade In a slow-inference regime, parallel tool execution is appealing because each model call is expensive. Fewer model calls is a big win and the engineering complexity is worth it. But the planning step itself is slow, which means the model spends a lot of time deciding what to fan out, and the rejoin step is slow, which means the agent stalls on every batch. In a fast-inference regime, parallel tool execution becomes more powerful in a different way. The planning step is cheap, so the model can afford to plan complex fan-outs and revise them. The rejoin step is cheap, so the model can quickly process a batch of results and immediately fan out the next wave. Multi-wave parallel agents become viable: fan out, rejoin, fan out again, rejoin again, all within the latency budget that a sequential agent would have spent on the first wave alone. The other thing fast inference unlocks is the ability to fall back to sequential when parallelism is risky. If your inference is fast enough that sequential is cheap, you do not have to lean as hard on parallelism for ambiguous cases. You only fan out when you are confident the calls are independent, and you serialize otherwise. The agent ends up safer and the user does not notice the difference. Slow-inference parallel agents are forced to fan out aggressively just to keep the latency budget under control. Fast-inference parallel agents fan out when it is right and serialize when it is right. That flexibility is the real product of fast inference, not just raw speed. ## Implementation notes A few things to watch for when building or operating a parallel tool-calling agent. Match the tool to the parallelism. Tools that are pure reads, idempotent, and side-effect-free are safe to fan out aggressively. Tools that write, mutate state, or have side effects should be reviewed before the agent is allowed to issue them in parallel. A user-confirmation step is reasonable for the write tools. Measure rejoin latency. The total wall-clock latency of a parallel agent step is max(tool latencies) plus the rejoin inference call. If you only look at average tool latency, you will miss where the time actually goes. Watch for redundant calls. Sequential agents naturally avoid them. Parallel agents do not. Add observability that counts parallel calls per turn, the size of fan-outs, and the fraction of calls whose results were ignored. Cap the fan-out width. Models will sometimes emit very wide fan-outs, ten or twenty calls in a single response, when prompted aggressively. Past a certain width, the straggler problem and the rejoin prompt cost outweigh the savings. A cap somewhere between four and eight is a reasonable default unless you have a specific workload that benefits from more. ## Closing thought Parallel tool execution is one of the most useful agent latency optimizations available, but it only pays off when inference is fast enough that the planning and rejoin steps do not consume the savings. If you are designing an agent for a serving stack that struggles to keep model calls under a second, parallelism will help but the ceiling will be low. If you are designing for fast inference, parallelism becomes a tool you can apply selectively and aggressively, and the agent ends up feeling responsive in a way that sequential designs cannot match. General Compute's inference stack is built to keep the planning and rejoin steps fast enough that fan-out is worth doing. If you are building agents and finding that parallel tool calling is not paying off, try the [API](https://generalcompute.com) and measure the rejoin latency against what you have now. The difference is usually where the speedup lives. --- # Agent Memory Systems: Balancing Context Length vs Retrieval Latency > How agents reconstruct memory between turns, and the latency trade-offs between long context, RAG, summarization, and KV cache reuse. - Author: General Compute - Published: 2026-05-12 - Tags: agents, memory, rag, kv-cache, inference, latency - Canonical: https://www.generalcompute.com/blog/agent-memory-systems-balancing-context-length-vs-retrieval-latency --- An agent has no memory between turns. Each call to the model is stateless. Whatever the agent "remembers" about a session, a user, or its own past actions has to be reconstructed and placed into the prompt every time the model runs. How you reconstruct that memory is the agent memory system, and the design space has real latency consequences. The three dominant approaches are long-context (put everything back into the prompt), retrieval-augmented generation (store memory in an index and pull out the relevant slices), and summarization (compress history into shorter blobs the model can reread). Each one trades a different axis: prefill cost, retrieval latency, information loss, and complexity. The right choice depends on the access pattern of the agent, not on which technique is currently fashionable. This post walks through the four approaches that show up in real systems, attaches realistic latency numbers to each, and explains the hybrid pattern most production agents end up using. ## What "memory" actually means in an agent The word memory in this context covers three distinct things, and the literature often blurs them. Conversation memory is what the agent has said and done in this session: prior tool calls, prior responses, the user's recent messages. It grows monotonically until something compacts it. User memory is what the agent knows about the person it is helping. Preferences, prior projects, name, tone, things they have asked before across sessions. This is durable and typically lives in a database. Working memory is what the agent has temporarily loaded for the current task: the file it is editing, the documentation page it pulled, the API spec it needs to follow. This is short-lived and per-task. A memory system has to handle all three, and each one has a different latency profile. Conversation memory wants fast prefill caching. User memory wants fast vector retrieval. Working memory wants fast tool calls. A single design that ignores the difference will get one of them wrong. ## Approach 1: Long context The simplest approach is to send everything back to the model on every call. The conversation history, the relevant files, the user profile, all of it. Modern models with 128K or 200K context windows can technically hold a lot. The cost is prefill. A 50,000 token context, processed cold, takes between 1.5 and 4 seconds depending on the backend, the model size, and whether prefill is chunked. If your agent runs 20 model calls in a task, paying full prefill on each is enough to ruin the experience by itself. This is where prefix caching helps. If the front of the context is stable across calls (system prompt, fixed memory blocks, the conversation up to turn N), a serving stack with prefix caching only pays prefill cost on the new portion. Done well, this turns a 50K token prefill into a 2K token prefill, with the corresponding latency reduction. The catch is that not every backend implements prefix caching well. Some only cache the system prompt. Some invalidate the cache on small differences. Some claim to cache but show a small fraction of the theoretical speedup. If your agent strategy depends on prefix caching, test it specifically with your prompt structure rather than trusting the marketing copy. Long context also has a quality ceiling. Models attend to the middle of long contexts less reliably than to the ends. The "lost in the middle" finding from Liu et al. is well replicated and shows up in agent workloads as silently degraded recall. If your agent's reasoning depends on a fact buried at position 30,000 of a 60,000 token context, it may not reliably use that fact. ## Approach 2: Retrieval-augmented memory The RAG approach treats memory as an external store. You embed pieces of memory (past conversation turns, documents, user facts) into a vector database. At each turn, you query the database for the K most relevant items and inject them into the prompt. The model sees a much shorter context because only the relevant slices are present. This sounds clean. In practice, two latencies matter. The first is retrieval itself. A well-tuned vector search over a few million items returns in 20 to 80 milliseconds. A poorly tuned one, or one that uses a heavy cross-encoder reranker, can take 300 to 500 milliseconds. Add network round trips and the overhead of constructing the query embedding (another model call, usually) and you can spend 200 to 800 milliseconds on retrieval before the main model has even started. The second is the lost prefix cache. Because the retrieved chunks change between calls, the prompt structure changes, which busts the cache. You save on total prefill tokens but pay prefill cost on what is left. If the retrieved memory adds 8K tokens of fresh prefill on every turn, you have not saved as much as you think compared to a long context with prefix caching. RAG also has a quality problem that is specific to agents: relevance is measured against the current query, but agents need information that may be relevant to the next query, several steps ahead. A retrieval system tuned for single-turn QA will under-retrieve for an agent doing multi-step reasoning. Tuning retrieval for agent workloads is its own field, and the latency cost grows fast when you add reranking, hybrid retrieval, or multi-query expansion. ## Approach 3: Summarization and compaction The third approach periodically compresses the agent's history into shorter summary blocks. Once the conversation gets past some token threshold, an asynchronous job (or an inline call) summarizes the oldest turns and replaces them with a summary. The agent sees a stable context that grows slowly even when the underlying conversation runs for hours. The latency profile here is interesting. Each summarization call is not free: it is itself a model call, usually with a long input and a moderate output. Done synchronously, it adds 2 to 4 seconds at the point of compaction. Done asynchronously, it adds nothing to the immediate response but creates a queue you have to manage and a window where the agent has both the raw history and the pending summary in memory. Summarization is lossy. The summary contains less than the original, by design. The art is in choosing what to preserve. Most production systems preserve tool call results, decisions, and user statements, while compressing reasoning chains. This works most of the time and breaks the times it does not. If the agent's later step needed a specific fact that was discarded, the agent fails in a confusing way. For agents running over long sessions, summarization is unavoidable. Context windows do not scale faster than user expectations for session length. The question is not whether to compact, but when and how aggressively, and which other approaches to combine it with. ## Approach 4: KV cache as memory A less-discussed option treats the KV cache itself as the memory medium. The prefill cost is the price you pay to load memory into the model's working state. If you can keep the KV cache resident, subsequent calls can skip prefill on the cached portion entirely. This is what prefix caching does within a single conversation, but the same mechanism can extend further. Some inference stacks (SGLang's RadixAttention, vLLM's prefix caching, certain custom serving paths) maintain cross-request KV caches. If two requests share a prefix, the second reuses the first one's cache. For agent workloads where many sessions share a long system prompt and a memory block, this is a five to ten times speedup on TTFT for steps after the first. The cost is memory pressure on the inference backend. KV caches are not small. A single 50K token cache on a 70B model can occupy a few gigabytes of GPU memory. Keeping many of them resident requires either a lot of headroom or a smart eviction strategy. Most public inference providers do not expose this level of control, so you cannot decide which sessions stay warm. Custom inference stacks can. There is also a fragility to KV-cache-as-memory. Any change in the prefix (a single edit to the system prompt, a different memory block, a reordered tool list) invalidates the cache. Production systems that depend on this savings have to be careful about prompt stability, and that constraint propagates into how the memory system can update itself. ## The hybrid pattern most agents converge on Production agent systems usually combine all four approaches. A short window of recent turns sits in the model's context verbatim. This is the conversation memory layer. Prefix caching handles the repeated work. Older turns are summarized and stored as compressed blocks. The summary is included in the system prompt up to some token budget. This is the compaction layer. Durable user memory and project knowledge live in a vector store. The agent retrieves a few relevant chunks per turn. This is the retrieval layer. The KV cache, where available, holds the stable parts warm across calls. This is the infrastructure layer. Each piece exists because the others have failure modes. RAG misses things, summaries lose details, long context degrades in the middle, and KV cache costs memory. The combination is more robust than any one piece, and it is also more complex. The latency budget for a single agent turn under this design looks roughly like this: | Component | Latency | |-----------|---------| | Vector retrieval | 50 to 200 ms | | Prompt assembly | 5 to 20 ms | | Prefill on uncached portion | 100 to 400 ms | | Decode | 1 to 5 s | | Total | 1.2 to 6 s | The decode usually dominates the total. If your retrieval system is slow, or your prefix cache is not actually hitting, or your summarization is happening synchronously, the budget breaks at the link that is leaking. Running an agent in production is largely the work of finding and fixing those leaks. ## What inference speed does to the design space Faster inference changes the relative weights of these approaches. With slow inference, summarization is painful because each summarization call is itself slow, so you avoid it and lean on retrieval. With fast inference, you can summarize more aggressively because the cost is lower, and you can also tolerate longer contexts because the prefill is faster. The same logic applies to the cost side. If a model call costs ten cents in latency-adjusted user value, you avoid extra calls. If it costs one cent, you spend liberally on memory hygiene calls (summarization, reranking, multi-query retrieval) because they make the final answer better. Memory design becomes an optimization problem in dollars per quality point, and the optimum shifts when inference gets cheaper or faster. This is one of the reasons fast inference matters more for agents than for chat. A chat application has one inference call per user turn, and the user is forgiving. An agent has many calls per task, and the user is watching the wall clock. Speeding up the model by a factor of three does not just speed up the response. It changes what is feasible to do inside the loop. ## Picking the right combination for your agent A few rules of thumb from looking at real systems. If the agent's sessions are short (under five turns) and the user expects near-instant responses, lean on long context with aggressive prefix caching. Skip RAG until you have evidence you need it. Skip summarization entirely. The complexity is not worth it at that session length. If the agent runs for many turns per session (a coding agent, a long support conversation, a research assistant), you need compaction. Run summarization asynchronously when you can, synchronously at well-defined breakpoints when you cannot. Budget for two to four seconds of summarization latency every N turns and design the UX around it. If the agent draws on a large durable knowledge store (documentation, user records, prior conversations across sessions), you need retrieval. Spend the time to tune retrieval for your access pattern, including a reranker if the budget allows. Treat retrieval latency as a first-class metric, not an afterthought. If the agent serves many users with shared structure (the same system prompt, the same tool definitions, the same boilerplate), invest in a serving stack with cross-request prefix caching. The savings compound across users. The trap to avoid is adopting all four approaches reflexively because some blog post said agents need them. Each one has a latency cost. Each one adds failure modes. The simplest design that meets your access pattern is the right one, and you can add complexity when you measure that you need it. ## Where General Compute fits We run inference for agent workloads where memory churn is part of the access pattern. Prefix caching, fast prefill on growing contexts, and steady decode throughput are not optional in this setting. They are the things that decide whether the agent's memory layer can be designed for quality or has to be designed around the limits of a slow backend. If you are building an agent and the memory system is the part that feels expensive or fragile, the inference layer underneath it is doing more to shape the design than people usually credit. Our API is OpenAI-compatible and tuned for agent-shaped workloads. If your latency budget on memory operations is breaking the experience, that is the kind of problem worth bringing to us. --- # Building a Code Agent: Why Each Step Needs Sub-Second Inference > A practical breakdown of the latency budget inside a code agent, step by step, and why every link in the chain needs to land under a second to keep the loop usable. - Author: General Compute - Published: 2026-05-11 - Tags: coding-agents, inference, latency, developer-tools, agents - Canonical: https://www.generalcompute.com/blog/building-a-code-agent-why-each-step-needs-sub-second-inference --- A code agent is a loop. You give it a goal, it reads files, runs commands, edits code, runs tests, and reads the output of those tests. Each cycle has at least one model call in it, often several. The user perceives the agent as fast or slow based on the total wall clock between the moment they hit enter and the moment the agent stops emitting tokens. That number is the sum of every step inside the loop, and the loop usually runs many times. This is the part of agent design that people gloss over when they sketch the architecture on a whiteboard. The boxes look small. The arrows look short. You can fit a whole agent on one slide. What the slide does not show is that an average task touches the model fifteen to forty times, that each model touch has a prefill and a decode and a structured output pass, and that the user experiences the cumulative result rather than any single call. This post walks through the actual steps of a code agent, attaches a latency budget to each, and explains why anything slower than roughly a second per step pushes the total task time into a range that breaks the interactive feel. ## The latency budget you actually have A developer using a code agent inside their editor expects the agent to keep pace with their attention. Twenty seconds of wall time is the upper edge of acceptable for a non-trivial task. Forty seconds and the developer alt-tabs. Two minutes and they go look at Slack and probably never come back to that tab in the same flow state. If you accept a twenty second budget for a task that involves twenty model calls in some combination of planning, tool selection, code generation, and review, you have one second per call on average. That is the headline number. It does not say every call must finish in a second, because some calls are cheap and some are expensive, but it says the average has to land there, and the slowest call cannot eat the whole budget. A second per call sounds generous. It is not. Most of that second is spoken for before you write any code, because of how prefill and decode interact with structured output. Let us go through the steps. ## Step 1: Reading the user request and planning The first step usually involves a model call where the agent reads the user's instruction along with whatever context it has about the project, and either produces a plan or decides which tool to call first. This call has a moderately large input (system prompt, file tree summary, conversation history) and a small to moderate output (a plan or a tool call). The latency profile is dominated by time to first token. With an 8,000 token input and a fast prefill backend, you can get to first token in 200 to 400 milliseconds. Decoding 150 tokens of plan at 80 tokens per second adds another 1.9 seconds. So a "fast" planning step is already 2.1 to 2.3 seconds. If your inference backend has a slower prefill or a slower decode (50 tokens per second is common on contended endpoints), the same step takes 3.5 to 4 seconds. You are now one step in and either at 10 to 12 percent of your budget or at 20 percent of it. The rest of the steps still need to fit. ## Step 2: Tool selection and tool calls Code agents call tools constantly: read a file, search the codebase, run a shell command, list a directory. Each of these tool calls is preceded by a model inference that decides which tool to use and produces the structured arguments for it. Then the tool runs (typically fast, under 100 milliseconds for file IO). Then the model reads the tool result and decides what to do next. A tool selection call has a small output: usually under 100 tokens of structured JSON for the tool name and arguments. The trap is that structured output generation is slower than freeform decoding on most inference stacks because of constrained decoding, schema validation, and the lower batchability of small structured outputs. A realistic profile for a tool call inference is 300 milliseconds of time to first token plus 700 milliseconds to produce the JSON. That is one second per tool call decision, on a fast backend. The agent's loop typically performs three to eight tool calls before it has enough context to produce a code edit. That is three to eight seconds spent just on tool decisions, before any code has been written. ## Step 3: Reading tool results and reasoning Once tools return data, the agent has to read it. If the tool was a file read, the model now has a 2,000 to 10,000 token addition to its context. The prefill for the next call has to process those new tokens. Without prefix caching, you pay full prefill cost for the entire context every time. With prefix caching, you pay only for the new portion, but the savings depend on whether the cache is warm and whether the serving stack actually streams the cached prefix at the speed it claims. The reasoning step itself may or may not be visible to the user. Some agent frameworks separate "think" turns from "act" turns. Others fold the reasoning into the same call as the next tool selection. Either way, the model is generating tokens. A thinking step that produces 300 tokens of internal reasoning takes 3.75 seconds at 80 tokens per second, or six seconds at 50 tokens per second. This is where the latency starts to feel oppressive. The user sees the agent "thinking" with nothing visible to them, and the longer the think turn, the more it looks like the system is stuck. Streaming the thinking helps, but only if the front end is built to display it as it arrives. ## Step 4: Generating code The code generation step is the one users tolerate the longest, because they can see the code appearing and they understand intuitively that more code takes more time. A 500 token diff at 80 tokens per second is 6.25 seconds. A 1,500 token rewrite of a file is 18 seconds. These are real numbers from real workloads. The interesting thing about code generation is that it is the only step where the decoder throughput dominates the latency, because the input is already in context from earlier steps and the output is the bulk of the work. This is the step where speculative decoding pays off the most, because correct guesses can multiply effective throughput by two or three times. If your inference stack supports speculative decoding for the size of model you are running, this step gets cheaper without changing the model. It is also the step where chunked prefill and continuous batching matter, because long generations interact with other requests in the same batch in ways that can starve them. We have written about both elsewhere on this site. ## Step 5: Running tests and reading the results The agent runs tests as a subprocess. The test run is not inference time, but it still counts against the wall clock. A fast test suite returns in three to ten seconds. A slow one takes minutes. The agent then has to read the test output, which is usually a few hundred to a few thousand tokens of failure output, and decide whether to iterate. The decision step here is another tool selection or planning call, with the same latency profile as step 2: roughly one second on a fast backend. If the tests passed, the agent emits a completion message. If they failed, the loop restarts from step 3 or step 4 with the new error context. ## What this adds up to A simple task that runs the loop once: | Step | Time on fast backend | Time on slow backend | |------|---------------------|---------------------| | Plan | 2.2s | 4.0s | | Tool calls (5) | 5.0s | 9.0s | | Read results and reason | 2.0s | 4.0s | | Generate code | 6.0s | 12.0s | | Run tests | 5.0s | 5.0s | | Verify | 1.5s | 3.0s | | Total | 21.7s | 37.0s | A task that needs two iterations of the loop (which is more common than one-shot success in real workloads) doubles most of those rows. The fast backend lands at around 35 seconds, the slow one at 65 seconds or more. The fast backend feels usable. The slow one does not. The "sub-second per step" framing is a useful target because most of the steps above were budgeted at one to two seconds each on the fast column. The total breaks when individual steps slip toward two or three seconds, because the slips compound across the loop. ## What gets you to sub-second steps A few things move the needle, in roughly the order of how much they matter for an agent workload. First, raw decode throughput. A model that decodes at 200 tokens per second instead of 80 makes the code generation step two and a half times faster, and most of the other steps proportionally faster too. The model size is not the whole story here. A 70B model on hardware tuned for low-latency decoding can outperform a 13B model on a stack tuned for batch throughput, in the regimes that matter for agents. Second, time to first token. Many agent steps have short outputs but moderate to large inputs. TTFT dominates the latency for these. Backends that parallelize prefill aggressively, or that use chunked prefill to keep decode running during prefill, win here. Third, structured output performance. Tool calls dominate the inference count in most agent workloads. If your stack is twice as slow at structured output as at freeform decoding, you have effectively doubled the latency of half your calls. Some inference providers handle this well, some do not. Test it explicitly. Fourth, prefix caching that actually works. Code agents accumulate context as the loop progresses, and the prefix grows monotonically until something compacts it. If the serving stack reuses the KV cache from prior calls in the same conversation, every step after the first one gets a faster prefill. This is one of the few places where infrastructure can give you a five to ten times speedup on prefill without changing anything in the model. Fifth, parallelism where it is safe. Some agent steps can run in parallel: multiple file reads, multiple search queries, multiple lints. If the agent framework can dispatch these concurrently and the inference backend can serve them without queueing, the loop tightens without losing correctness. This requires both an agent design that supports it and an inference endpoint with the headroom to actually run requests in parallel. ## Where General Compute fits We built our inference stack on custom silicon because the standard GPU serving path is not optimized for the access pattern that agents have. Short structured outputs, frequent prefill on growing contexts, and long tail latency on a small percentage of calls all hurt agents disproportionately. Our hardware path is shorter on the small calls and steadier on the large ones, which is the shape that matters when you are summing twenty calls into a single user-perceived wall clock. If you are building a code agent and the per-step latency budget is what is breaking your product, our API is OpenAI-compatible and tuned for exactly this workload. Try it on your hardest test cases, the ones where current providers feel too slow, and see what the loop time looks like. The math in this post is the math we optimize for every day. --- # ReAct, Reflexion, and Chain-of-Thought: The Inference Cost of Reasoning Patterns > Popular agent reasoning patterns are described as prompt techniques, but they are inference cost multipliers. Here is how ReAct, Reflexion, and Chain-of-Thought actually shape the bill and the latency. - Author: General Compute - Published: 2026-05-10 - Tags: agents, reasoning, react, reflexion, chain-of-thought, inference, latency - Canonical: https://www.generalcompute.com/blog/react-reflexion-and-chain-of-thought-the-inference-cost-of-reasoning-patterns --- If you read enough agent papers, you start to notice a pattern. Each one introduces a "method" with a name (ReAct, Reflexion, Tree-of-Thought, Self-Refine, Chain-of-Verification), a clever prompt template, and a benchmark table showing the new method beats the old method by a few points. The methods are described as prompting techniques. The benchmarks rarely report wall-clock latency, and almost never report total tokens generated per task. In production, those numbers are the whole story. A reasoning pattern is not really a prompt template. It is a recipe for how many model calls a single user request will fan out into, how long the chain of dependencies between those calls runs, and how many tokens get generated on the way to the answer the user actually sees. The differences between Chain-of-Thought, ReAct, and Reflexion are not subtle from an infrastructure perspective. They are 2x, 5x, and 20x. This post walks through what each of these patterns actually does at the inference layer, where the cost goes, and why the choice of reasoning pattern is one of the highest-leverage architectural decisions in any agent system. ## Chain-of-Thought: a single call with extra tokens Chain-of-Thought is the simplest of the three, and the cheapest. The prompt asks the model to "think step by step" before answering, and the model produces a stretch of intermediate reasoning followed by a final answer. There is one inference call. The user sees one response. The only cost over a non-CoT call is the extra output tokens for the reasoning trace. That extra cost is real but bounded. A typical CoT trace adds 100 to 500 output tokens to a response that might otherwise have been 50 tokens. On a model that decodes at 100 tokens per second, that translates to one to five extra seconds of latency. On a model that decodes at 1,000 tokens per second, the user barely notices. Two things make CoT work well in production. First, the entire reasoning happens in a single decode pass, which means the KV cache is reused across the whole trace and the model never has to re-read its own intermediate output. Second, you can stream the output to the user, so even though the total response is longer, time-to-first-token is unchanged. If your UI is set up to show "thinking" indicators while the model produces its trace, the perceived latency can actually be better than a terse non-CoT response, because the user sees progress. The trap with CoT is that it is so cheap teams stop noticing the extra tokens. A 200 token reasoning trace on every customer-facing call adds up to real money at scale, especially if the trace is mostly boilerplate ("Let me think about this carefully. The user is asking about..."). Periodic audits of how much of your output token spend is reasoning versus answer are worthwhile. ## ReAct: interleaving thought, action, and observation ReAct (Yao et al., 2022) extends Chain-of-Thought by interleaving reasoning with tool use. The model alternates between Thought, Action, and Observation steps. It produces a thought, decides on an action (a tool call), the tool runs, the observation gets fed back into the prompt, and the model produces the next thought. This continues until the model decides it has enough information to answer. At the inference layer, this pattern is fundamentally different from CoT. Each Thought-Action cycle is a separate model call. The model emits a Thought and an Action, generation halts at the Action token boundary, the tool runs, and then a new prompt is constructed that includes the original prompt plus all prior Thoughts, Actions, and Observations. That new prompt gets fed back into the model for the next step. The cost structure has three components. First, the prompt grows linearly with the number of steps. Each round adds the prior Thought, the Action call, and the Observation (which is often the largest piece, especially if the tool returns search results or document chunks). After five steps, the prompt might be 10,000 tokens longer than it started. By step ten, the input cost dominates the output cost by a wide margin. Second, every step pays a fresh time-to-first-token. The model has to prefill the entire growing prompt on each round. Prefix caching helps if the inference stack supports it well and the framework constructs prompts in a stable order, but a non-trivial fraction of agent frameworks build prompts in ways that defeat caching (timestamps in system prompts, randomized example orderings, dynamically reordered tool descriptions). When caching fails, each step pays full prefill latency on a longer and longer prompt. Third, the steps are sequential by construction. ReAct does not parallelize. The Action in step N depends on the Observation in step N-1, which depends on the Action in step N-1. The total wall-clock time is the sum of every TTFT and every decode in the chain. A five-step ReAct loop on a model with 800ms TTFT and 100 tokens per second of decode, where each step generates about 80 tokens, takes roughly 5 * (0.8 + 0.8) = 8 seconds of pure model time, plus whatever the tools themselves take. That is the gap between a paper's "ReAct improves on CoT by 4 points" and the production experience of "ReAct makes our agent feel sluggish." The benchmarks measure accuracy. The user feels the latency stack. ## Reflexion: ReAct with retries Reflexion (Shinn et al., 2023) adds a self-improvement loop on top of ReAct. After an attempt fails (or scores poorly on some self-evaluation), the agent reflects on what went wrong, produces a written critique of its own behavior, and tries again with the critique loaded into context. Some variants run several attempts and pick the best. This is where the cost numbers stop being polite. A Reflexion agent that runs three attempts of a five-step ReAct chain, with a self-critique inference call between each attempt, is doing roughly 3 * 5 + 2 = 17 model calls for a single user task, and each ReAct chain inside the loop is paying the linear prompt growth described above. The critique step itself is often expensive because it has to read the entire failed trajectory, which by attempt three can be 20,000 tokens or more. The latency is even worse than the token count suggests, because the attempts are sequential. You cannot critique a trajectory until it has finished. You cannot start the next attempt until the critique is done. A Reflexion run with three attempts and a five-step ReAct inner loop, on the same hardware as the example above, takes around 30 seconds of pure model time for the ReAct portions and another 5 to 10 seconds for the critique steps. Forty seconds before the user sees a final answer. Reflexion was developed for benchmark settings where you can afford to run many attempts and pick the best. It is genuinely useful in domains like code generation where you can run a unit test between attempts and bail out as soon as one passes. It is brutal in domains where you have to run all attempts to evaluate them, or where the self-critique is itself unreliable and the agent talks itself out of correct answers. ## Tree-of-Thought and friends: branching the cost Tree-of-Thought (Yao et al., 2023) generalizes Chain-of-Thought into a search over reasoning steps. At each level, the model produces several candidate thoughts, an evaluator scores them, and the search expands the most promising branches. Variants like Graph-of-Thought, Algorithm-of-Thought, and Self-Consistency CoT are similar in shape. The inference cost for ToT is the product of the branching factor, the depth, and the per-step cost. With a branching factor of three, a depth of four, and an evaluator call at each level, you are looking at 3^4 = 81 leaf nodes plus 40 internal evaluator calls in the worst case, even before pruning. Real implementations prune aggressively, but even a moderately aggressive search produces 20 to 30 model calls per task. The good news for ToT is that the branches are independent within a level. With sufficient inference capacity, you can run all three children of a node in parallel. The bad news is that the evaluator step is a synchronization barrier: every branch at level N has to finish before the evaluator picks which to expand at level N+1. So even with parallelism, the wall-clock latency is the depth times the per-step latency, not just the per-step latency. In practice, ToT tends to live in research papers rather than production agent stacks, because the cost structure is hard to justify outside of problems where the search tree genuinely matters (theorem proving, certain planning tasks). For most agent workloads, the gain from search-based reasoning is smaller than the gain from making any one inference call faster. ## Self-Consistency: the cheap parallel cousin Self-Consistency CoT (Wang et al., 2022) takes a different approach to using extra compute. Instead of adding sequential steps, it runs the same Chain-of-Thought prompt N times in parallel with sampling, and picks the answer that appears most often (majority vote). Five samples is typical. The interesting property of Self-Consistency from an inference perspective is that all five calls are independent. They share the same input prompt (so prefix caching is effective if the serving stack supports it), they run fully in parallel, and the only synchronization is the final vote. Wall-clock latency is roughly the same as a single CoT call, plus a small overhead for the vote. The cost is 5x the tokens, but the user-visible latency is barely affected. This makes Self-Consistency one of the few reasoning patterns that scales cleanly with inference capacity. If your serving infrastructure can handle the parallel calls without queueing, you get the accuracy bump without the latency penalty. If your provider's headroom is tight, the parallel calls serialize and you get full sequential cost. ## Choosing a pattern is choosing an inference profile The choice between these patterns is usually framed as a quality decision: which one gives the best accuracy on your benchmark? The infrastructure framing is at least as important. Chain-of-Thought is essentially free in latency terms (especially with streaming) and adds bounded token cost. It should be the default for any task where you do not need tools. ReAct is the standard for tool-using agents but it pays linear prompt growth and sequential latency. If your tools are slow, that is not where the latency comes from. The model calls themselves are. Faster inference, better prefix caching, and shorter system prompts all directly reduce ReAct latency. Reflexion and other retry-based loops should be used selectively, in domains where the retry actually helps and the self-critique is reliable. The cost is high enough that you want it gated by an external signal (a failing test, a low confidence score, a user-visible error) rather than run by default. Tree-of-Thought and other search-based patterns are expensive enough that they only make sense for problems where search structure is intrinsic to the task. Self-Consistency is the cheapest way to spend more compute for more accuracy if you have the parallel inference capacity to support it. The common thread across all of these is that the reasoning pattern is doing the same thing as the underlying inference layer is doing: trading compute for quality. The pattern decides how the compute is shaped (sequential versus parallel, long context versus many calls, structured output versus free-form generation), and the inference layer decides how fast each unit of compute actually runs. When teams are unhappy with how their agent feels in production, the conversation tends to start at the prompt level, then move to the framework level, and only later get to the model and the inference layer. In our experience, the inference layer is usually where the largest single improvement is available, especially for ReAct-style patterns where sequential model calls dominate the latency budget. If you want to see how that plays out on your own agent traces, the OpenAI-compatible API is at [generalcompute.com](https://generalcompute.com). --- # Multi-Agent Architectures and the Inference Cost Explosion > Orchestrator and worker patterns make multi-agent systems easy to design and expensive to run. Here is where the inference cost actually goes, and what it means for the infrastructure underneath. - Author: General Compute - Published: 2026-05-09 - Tags: agents, multi-agent, inference, latency, cost - Canonical: https://www.generalcompute.com/blog/multi-agent-architectures-and-the-inference-cost-explosion --- Multi-agent architectures look elegant on a whiteboard. You have an orchestrator that breaks a task into subtasks, a pool of workers that handle the subtasks, and a critic or aggregator that reviews and merges the results. Each agent has a clean role. Each prompt is small and focused. The mental model maps onto how a team of humans would split up the same job. Then you put it into production and the inference bill is four to twenty times what a single-agent system would cost on the same workload. Latency on the user-visible path is worse, not better, because the orchestrator is now serially gated on its workers. The system is harder to debug because failures show up two or three hops away from where they originated. The promise of decomposition is real, but the cost structure is not what most teams expect when they start. This post is about why multi-agent systems are so much more inference-hungry than they look, where the cost actually accumulates, and which architectural choices change the math. ## What "multi-agent" actually means at the inference layer When practitioners say multi-agent, they usually mean one of three patterns. The first is the orchestrator-worker pattern. A planner LLM reads the task, produces a plan, and delegates pieces of it to worker LLMs. The workers run in parallel or in sequence, return results, and the orchestrator decides what to do next. Most agent frameworks ship some version of this as their default abstraction. The second is the role-playing pattern. A handful of agents each carry a persona: a researcher, a coder, a reviewer, a summarizer. They take turns producing output, often with a shared scratchpad or message bus between them. CrewAI and AutoGen popularized this style. The third is the debate or critique pattern. Multiple agents independently produce candidate answers, then one or more critic agents compare them, and a final agent picks or merges. This shows up in research papers more than production systems, but the inference profile is similar to the others. All three patterns share one property: they replace a single inference call with many inference calls, each of which carries its own prompt overhead, its own time-to-first-token, and its own decode budget. The interesting question is not whether this costs more (it does), but where the multiplier actually comes from. ## The prompt overhead multiplies, not the useful work Consider a single-agent system that handles a customer support task. The model gets a system prompt explaining the company, the tools available, and the policy guardrails. Maybe 4,000 tokens. It gets the conversation history, maybe 2,000 more tokens. It produces a 200 token response. Total: 6,000 input tokens, 200 output tokens, one model call. Now decompose the same task into a three-agent system: a triage agent classifies the request, a specialist agent handles the resolution, and a quality reviewer agent checks the response before sending. Each of these agents needs its own system prompt, because the triage agent should not have the specialist's tools and the reviewer should not have either of theirs. Each agent also needs context from the prior steps, because none of them have access to the original conversation by default. The triage agent reads 4,000 tokens of its own system prompt plus 2,000 tokens of conversation. The specialist reads 4,000 tokens of system prompt, 2,000 tokens of conversation, and the triage output. The reviewer reads its own 3,000 token system prompt, the conversation, the triage output, and the specialist's draft. You have gone from 6,000 input tokens to roughly 22,000 input tokens, and the actual user-visible output is the same 200 tokens. This is the part that surprises teams when they see the bill. Output tokens are usually a small fraction of cost. Input tokens are where multi-agent systems balloon, because every agent in the chain needs context, and that context overlaps significantly between agents. You are paying to re-feed the same conversation history through three different prompts, each with its own framing. Prefix caching helps if your serving stack supports it well, and if your agent framework happens to construct prompts in a way that produces stable prefixes. But the typical orchestrator-worker setup actively defeats caching, because the worker prompts include the orchestrator's task description, which changes on every call. ## Latency stacks badly in the orchestrator pattern The naive expectation is that multi-agent systems can run faster than single-agent systems, because workers can execute in parallel. In practice, parallelism only helps when the workers are genuinely independent and the orchestrator can dispatch them all at once. Most orchestrator-worker setups are not like this. A common pattern is sequential delegation: the orchestrator decides what to do, dispatches a worker, reads the result, decides what to do next, dispatches another worker. Each step has its own time-to-first-token and decode time. If your model has a 600ms TTFT and produces 100 tokens per second, a single orchestrator step that emits a 50 token plan takes about 1.1 seconds. A worker step that produces a 200 token result takes about 2.6 seconds. A four-step plan with one worker per step is 4 * (1.1 + 2.6) = 14.8 seconds before the orchestrator even produces its final answer. A single-agent system handling the same task would have one prefill, one decode, and would stream tokens to the user as they were produced. The user would see output starting at around 600ms and finishing around 8 to 10 seconds, with the appearance of progress the entire time. The multi-agent version finishes nominally faster on total compute, perhaps, but the user sees nothing for 14 seconds and then a sudden burst at the end. When workers can run in parallel, the math improves, but only if the serving stack can actually execute them in parallel without queueing. On a shared inference endpoint, parallel worker calls compete for the same batch slots as everyone else's traffic. If your provider does not have headroom, your "parallel" workers serialize behind queue contention, and you are back to sequential latency with extra steps. ## Tool calls compound the multiplier Most multi-agent systems are also tool-using. The orchestrator calls workers, the workers call tools, the tools return data, the workers process it, the orchestrator aggregates. Each tool call is itself an inference step that produces a structured output, and structured output generation is one of the slowest regimes for most serving stacks (we covered this in the post on tool calling latency). If a worker needs three tool calls to complete its subtask, and the orchestrator delegates to four workers, you have twelve tool calls plus four worker completions plus one orchestrator decision step. Seventeen inference calls to handle a single user request, each with its own prefill, its own decode, and its own structured output overhead. That is the cost explosion. The same task done by a single agent with the same tools would also produce three tool calls per logical subtask, but the agent would amortize the prefill across them through KV cache reuse, and the orchestrator overhead would not exist at all. The decomposition into multiple agents does not reduce the tool calls. It adds inference steps on top of them. ## The infrastructure assumptions break down Most inference infrastructure is built around a request-response model with one prompt in, one stream out. Multi-agent systems violate this assumption in ways that interact poorly with how serving stacks are tuned. Continuous batching, which is how modern inference servers extract throughput from GPUs, works best when individual requests are long enough to fill a batch slot for a meaningful number of decode steps. Multi-agent systems produce many short generations: a 30 token plan, a 50 token tool call, a 100 token critique. Each of these is a request that joins the batch, decodes briefly, and leaves. The throughput hit from request churn is real and shows up as lower tokens-per-second per GPU than the same hardware achieves on chat workloads. Prefix caching, as mentioned, is degraded by the way orchestrator-worker prompts are constructed. KV cache reuse across agent boundaries is essentially impossible unless the framework is explicitly designed for it, because each agent has its own system prompt and the cache keys do not match. Speculative decoding still helps within a single agent step, but the gains do not compound across agents the way they would within one long generation. A spec decoding speedup of 2x on one agent step still leaves you paying full latency for the gap between agents: the network round trip from worker back to orchestrator, the orchestrator's own prefill, the dispatch to the next worker. Speeding up generation does not help when generation is not the bottleneck. ## What this means for system design The right response to all of this is not to abandon multi-agent architectures. There are problems they handle well, especially ones with genuine subtask independence (large document processing, multi-source research, parallel code review across files). The right response is to design with the inference profile in mind from the start. A few practical implications. Decompose only when the subtasks are genuinely parallelizable or when the role separation provides real safety guarantees. If you are decomposing for organizational clarity but the agents end up running sequentially anyway, you are paying multi-agent cost for single-agent behavior. Share context aggressively between agents instead of re-feeding it through fresh prompts. Some frameworks have started supporting shared message buses or pinned context windows that survive across agent calls. These reduce the prompt overhead multiplier substantially when used well. Run on infrastructure that holds up under the request profile multi-agent systems produce: many short generations with structured output, tight latency tails, and bursty parallelism. The throughput numbers that providers publish on long chat workloads are not the numbers you will see on this traffic, and the gap between providers widens as request size shrinks. Measure end-to-end latency from the user's perspective, not per-agent latency. A multi-agent system that looks fast on a per-step dashboard can still feel slow because the user sees the time from request to first visible output, not the time of the fastest individual step. Multi-agent designs trade compute for clarity. That trade is sometimes worth it. But the compute side of the trade is larger than most teams budget for when they pick up an agent framework, and the latency side is what kills the user experience when the underlying serving stack is not built for the request shape. If you are running multi-agent workloads and finding that latency or cost is the limit, the inference layer is usually where the leverage is. Faster generation, lower TTFT, and serving infrastructure that handles short structured outputs well are what make these patterns viable in production. That is the shape of the problem we are working on at General Compute. If you want to see how it changes the math on your own workload, the API is OpenAI-compatible and the docs are at [generalcompute.com](https://generalcompute.com). --- # Tool Calling Latency: The Bottleneck No One Talks About > Function calling looks simple on paper, but the latency budget of a tool-using LLM is dominated by short structured generations that most serving stacks are not optimized for. This is what actually makes tool calls feel slow. - Author: General Compute - Published: 2026-05-08 - Tags: agents, tool-calling, inference, latency - Canonical: https://www.generalcompute.com/blog/tool-calling-latency-the-bottleneck-no-one-talks-about --- Tool calling is supposed to be the easy part of building with LLMs. The model emits a JSON object, you route it to a function, the function returns a result, you feed that back. Every major serving stack supports it. Every model card claims it. And yet, when you actually wire up a tool-using agent and put it in front of a user, the experience is often noticeably worse than the plain chat experience with the same model. The agent feels heavy. Each turn takes longer than it should. The dead air between "user asked something" and "the agent did something" is uncomfortable. Most of that comes from a part of the system nobody benchmarks: the latency of producing a tool call. Not the latency of the tool itself, but the inference time it takes the model to decide what tool to call and emit the structured arguments. This is short generation in a regime where most inference engines are slow, and where most published numbers do not apply. This post is about why tool calling is harder on the serving stack than it looks, where the time actually goes, and how that shapes which agent designs are practical. ## Why tool calls are short generations A typical tool call looks like this: the model is given a system prompt with tool definitions, the user message, and any prior tool results. It produces a small structured output, usually well under 100 tokens. Something like: ```json { "tool": "search_orders", "arguments": { "customer_id": "C-2398", "status": "pending" } } ``` That is maybe 30 to 50 tokens depending on how the schema is encoded. With a model that produces 100 tokens per second in steady state, the decode time for that call is around 300ms to 500ms. Add a TTFT of 400ms to 800ms and you are at roughly one second to produce a single tool call. Then the tool runs, the result comes back, and the model has to read it and produce the next thing. Maybe another tool call, maybe a final response. If you have ever wondered why agents that use tools feel sluggish on the same model that streams chat at a comfortable pace, this is the reason. Chat hides latency by streaming output the user reads while it generates. Tool calls cannot hide latency, because nothing is shown to the user until the call has completed and the result is back. The user is staring at a spinner during exactly the part of inference that serving stacks are worst at. ## The prefill is doing more work than you think The prompt for a tool-using model is not small. Tool definitions are verbose. A schema for a single moderately complex tool, with a description, parameter types, descriptions of each parameter, and example usage, is often 300 to 600 tokens. Real agents have multiple tools. A coding agent might have 8 to 15 tools defined, which puts the tool block alone at 3,000 to 8,000 tokens. On top of that, you have the system prompt that explains how the agent should behave, how to format calls, when to stop, what error states mean. Add the conversation history, prior tool calls, and prior tool results, and the prompt sent to the model on each step is regularly 5,000 to 20,000 tokens long. The output is 50 tokens. The ratio is wildly skewed toward prefill. This matters because prefill is where TTFT comes from. If your serving stack does not aggressively cache the prefix, every tool call pays the full prefill cost from scratch. On a 70B-class model with a 10k token prompt, that can be 600ms to 2 seconds just to start producing the first output token. The actual generation of the JSON object is the cheap part. Prefix caching is the obvious fix and it works, but it has to be set up correctly. The cache key is the exact token sequence of the prompt prefix. If your agent framework reorders messages, rewrites tool definitions on every call, or injects timestamps into the system prompt, your cache hit rate drops to zero without any obvious symptom. The model still works. It is just slow in a way that looks like the model is slow, when really the cache is being defeated by formatting churn upstream. ## Constrained decoding has a real cost Most production tool-calling setups use some form of structured output enforcement. JSON schema validation, grammar-constrained decoding, regex masks. These exist because models do not reliably produce well-formed JSON without them, especially smaller models, and especially under load when the sampling temperature drifts. Constrained decoding is not free. The standard approach is to compute a mask over the vocabulary at each step that allows only tokens consistent with the grammar, and then sample within that mask. Computing the mask requires walking the grammar state machine for every candidate token. Naive implementations do this on the CPU after each forward pass, which adds tens of milliseconds per token. On a 50 token output, that is one to three seconds of overhead on top of the model's actual generation time. Better implementations precompute mask tables, batch the grammar computation, and run it on the GPU alongside sampling. The state of the art adds maybe 1 to 5ms per token, which is acceptable. But many open source serving stacks are still in the naive regime, especially when used with custom schemas. If you are using JSON mode and feeling like the output rate is lower than the model's documented decode speed, this is probably why. There is a related problem with how schemas are translated into grammars. A loose schema that says "object with at least these fields" is faster to constrain than a tight schema that pins every field type and requires specific enums. People often write tighter schemas to push correctness onto the decoder, which is the right instinct, but it also makes decoding slower. The performance tax is usually not worth optimizing away unless you measure it. ## Why the model latency floor matters more for tools Chat applications have a graceful failure mode for slow inference. The user reads as the model writes. If TTFT is 800ms, the user might not even notice if the rest of the response streams smoothly. Tool calling does not have this property. The user sees nothing until the structured call is fully decoded, the tool runs, and either another call or a user-visible response is produced. This means the floor on perceived latency is much higher for tool-using interactions. If a chat response feels good at 800ms TTFT, a tool call that needs the same model to reach completion before any user feedback is at 800ms TTFT plus 500ms of decode plus tool execution plus another inference call to produce the final response, all before the user sees anything new. You are at three to five seconds of perceived wait time on a single tool turn, on a model that feels fast in chat. The product implications are concrete. UI patterns that work for chat do not work for tools without modification. You cannot stream a JSON tool call to the user, because partial JSON is not meaningful. The common workarounds are: - Show a "thinking" or "calling tool X" indicator the moment the model decides on a tool, even before arguments are complete. This requires the serving stack to surface the partial decode, which most do not by default. - Pre-decide which tool the model will call by using a smaller, faster model as a router, and only invoke the main model to fill in arguments. This adds complexity but cuts the perceived TTFT roughly in half for the common case. - Cache previous tool results aggressively so that re-asking is fast. This works for read-heavy workloads, less so for agents that do new things on every turn. None of these are exotic ideas, but they only become necessary when you have actually felt the latency. Teams that have only built chat tend to be surprised by how much of the agent UX problem is actually a serving problem. ## Concurrency and the long tail Tool-using workloads have a different concurrency profile than chat. A single user interacting with an agent generates a burst of inference calls within a few seconds, then a quiet period while the user reads the result and types a new message. Multiple users hitting the same endpoint produce overlapping bursts. If your serving stack is optimized for steady state throughput on long generations, it tends to handle these bursts poorly. Continuous batching helps, but only if the requests fit cleanly into the batch shape. Short generations have a lot of variance in how many decode steps they need, which causes head-of-line blocking when one request in a batch needs 200 tokens and the others only need 30. The fast requests sit idle waiting for the slow one to finish a step before the batch advances. This shows up as a long tail in the latency distribution. The p50 of tool call latency might be 800ms while the p99 is 4 seconds. For a single tool call this is annoying. For an agent that does 10 sequential tool calls, the chance of hitting at least one tail event approaches one. The agent's overall latency is dominated by the worst step in its sequence, which means the tail latency of a single call effectively becomes the typical latency of the whole task. The right metric to track for tool calling is not p50 of single calls, it is p99 of single calls, or even better, p50 of full task completion across a representative agent workload. Most serving teams do not measure this because it requires running an actual agent, not a synthetic load generator. ## The benchmark gap Public LLM serving benchmarks rarely measure any of this. The standard format is: 1k input tokens, 256 output tokens, single request, report tokens per second. This is a reasonable measurement for batch inference economics. It tells you almost nothing about how a model will perform inside a tool-using agent, where input is 8k tokens, output is 60 tokens, requests come in correlated bursts, and the prompt is mostly cacheable but only if you are careful. A more honest benchmark for tool calling would specify: - Input length distribution that matches real agent prompts (large system prompt with tool definitions, growing conversation history). - Output length distribution skewed toward short structured outputs. - A cache hit rate target, since that drastically changes the numbers. - Concurrent request bursts rather than steady throughput. - Constrained decoding overhead measured separately so it can be attributed. This is more work than the standard benchmark, but it produces numbers that actually predict whether a serving stack will work for agents. Without it, you end up choosing inference providers based on long generation throughput and being surprised when your tool-using agent feels slow on a model that benchmarks well. ## What this means for building agents The practical takeaway is that tool calling is a different inference workload from chat, and serving stacks that are good at one are not automatically good at the other. If your agent feels slow: - Measure cached TTFT, not steady state throughput. That is the number that controls per-step latency. - Verify that prefix caching is actually hitting. Stable serialization of the prompt across calls is the single highest leverage thing you can do. - Profile constrained decoding overhead separately. If you see a gap between documented decode speed and observed speed during structured output, this is probably it. - Look at p99 of single calls, not p50. Tail latency is what dominates multi-step task time. None of this is glamorous work. It is plumbing. But it is what separates an agent that feels responsive from one that feels stuck, on the same model with the same prompts. If you are building tool-using agents and the inference latency is what is making the experience worse than your chat product, [General Compute's API](https://generalcompute.com) is set up for the workload: short structured generations, high prefix cache hit rates, low TTFT, and predictable tail latency under bursty load. It is OpenAI compatible, so pointing an existing agent framework at our endpoint is usually a config change. The numbers that move when you do it are the ones the user actually feels. --- # The Agentic Inference Tax: Why Agents Need 10x Faster Models > Agents make many sequential LLM calls per task, and each one pays the full latency of decoding. This post walks through how that compounds and why fast inference changes which agents are even viable. - Author: General Compute - Published: 2026-05-07 - Tags: agents, inference, latency - Canonical: https://www.generalcompute.com/blog/the-agentic-inference-tax --- A chat application is a single LLM call followed by a stream of tokens to a user who is reading them. An agent is something else. It is a loop: think, call a tool, read the result, think again, maybe call another tool, eventually stop. Each pass through that loop is a separate forward pass through the model. The user sees one task. The system sees ten or twenty inference calls. This is the agentic inference tax. The model that felt fast enough for chat suddenly feels broken when you put it inside an agent loop, because every weakness in latency gets multiplied by the number of steps. A 2 second response time is fine when a person is reading the answer. It is a 30 second wait when the agent has to do 15 steps to finish a task. This post is about where that multiplier comes from, why the standard chat benchmarks miss it, and what changes when the underlying model gets meaningfully faster. ## A chat call versus an agent task In a chat, the cost structure is simple. You send a prompt, the model generates some output, you stream it. The user perceives two numbers: time to first token (TTFT) and tokens per second after that. If both are good, the experience is good. If TTFT is 400ms and the model puts out 80 tokens per second, a 200 token answer arrives in about three seconds, and the user starts reading well before generation finishes. An agent task does not look like that. The agent receives a goal, plans a step, generates a tool call, sends it to a tool, waits for the tool result, and feeds that result back into the next forward pass. In a typical ReAct loop the model often emits a short reasoning trace and then a structured call. None of those individual generations are long. Most are a hundred tokens or fewer. But each one pays the full cost of TTFT plus a small decode tail. And because the next step depends on the previous one, none of it parallelizes. If you have a model with 500ms TTFT and you do 10 sequential steps, you have just spent at least 5 seconds on TTFT alone, before counting decode time, tool execution, or any retries. In practice, a real agent task pays a lot more than that, because steps are not uniform. Some steps generate longer plans. Some require the model to read a large tool result and respond. Some get retried because the structured output failed validation. ## How the multiplier shows up in real workloads The cleanest way to see the tax is to instrument an agent and look at where the wall clock time goes. The general shape, across the agents I have worked with, looks like this: - 60% to 80% of total time is sequential LLM inference. - 10% to 30% is tool execution (HTTP calls, database queries, code execution). - The rest is overhead: serialization, retries, scheduler waits. In other words, the dominant cost is the LLM, not the tools, even when the tools themselves are not trivial. People often expect the opposite, because they think about the agent in terms of what it is doing in the world. But the agent spends most of its time generating the next sentence about what to do, not actually doing it. A 10 step coding agent that uses a model with 60 tokens per second decode and 600ms TTFT might have this profile: - 10 calls of TTFT: 6 seconds. - 10 calls of decode at roughly 80 tokens per call: about 13 seconds. - Tool execution averaged across calls: 4 seconds. - Retry overhead and structured output reparsing: 2 seconds. Total: around 25 seconds. The model itself accounts for 19 of those. If you swap in a model with 200ms TTFT and 200 tokens per second decode, the same 10 steps cost 2 seconds of TTFT and roughly 4 seconds of decode. Now total task time is closer to 12 seconds. Same agent, same prompts, same tools. Half the wall clock. That is the multiplier in action. A 3x improvement in the model's per-call latency turns into a 2x improvement in end-to-end task time, which is the number that actually matters to whoever is waiting. ## Why the existing inference benchmarks miss this Most published benchmarks measure throughput on long generations. A common setup is to send a 1k token prompt and ask the model to produce 256 or 512 output tokens, and report tokens per second across batch sizes. This is fine for measuring batch serving economics. It is not fine for measuring agent feasibility. Agent calls are short. A tool call is often 30 to 80 output tokens. A planning step is usually under 200. The model spends a much larger fraction of its time inside the prefill and the first few decoded tokens, where most engines are underutilized and where TTFT dominates. A model that does 300 tokens per second in steady state but takes 800ms to start producing the first token will look great on long-generation benchmarks and feel terrible inside an agent loop. This is also where prefix caching matters more than people realize. If your agent reuses a long system prompt across every step, and your serving stack rebuilds the KV cache from scratch each time, you are paying the prefill cost on every loop iteration. The right number to measure is "cached TTFT," the time to first token when the system prompt is already in cache. For agentic workloads, the gap between cached and uncached TTFT can be the difference between a 10 second task and a 60 second task. ## The retry problem Agents retry. This is not a bug, it is a property of how they work. The model sometimes generates malformed JSON. It sometimes calls a tool with the wrong arguments. It sometimes proposes a plan that fails its own self-check. The agent framework catches these and asks the model to try again. In a slow inference setting, retries are catastrophic. If your base case is 25 seconds and you have to retry one step, you are now at 30 seconds. Retry two steps and you are at 35. The agent that worked in evals starts feeling unusable in production, because production has a wider distribution of inputs and the tail of retries shows up. Faster inference does not eliminate retries. It changes the cost of retrying. With a fast enough model, the agent can afford to be more aggressive: generate two candidate plans and pick the better one, validate every tool call before executing it, run a self-critique step. Each of those is another LLM call, which means each one adds latency. If a single call costs 200ms instead of 2 seconds, those extra calls become affordable. There is a useful reframing here. Slow inference forces you to design agents that are minimal: as few steps as possible, no double-checking, no parallel exploration. Fast inference lets you design agents that are robust: more steps, more validation, more retries when something looks off. The set of feasible architectures changes with latency. ## Voice agents and the 500ms ceiling Voice agents are the clearest case where the inference tax becomes a hard constraint. Conversational turn taking expects a response within roughly 500ms to feel natural. That budget has to cover everything: ASR finalization, the LLM call, possibly a tool call, TTS synthesis, and audio playback startup. If your LLM TTFT is 600ms, you have already missed the budget before the model has produced anything. The voice agent will feel laggy no matter how good the rest of the stack is. This is why voice deployments often resort to small models, aggressive prompt caching, and parallel speculative paths: the latency budget cannot be met any other way. For multi-turn voice agents that do tool calls, the tax compounds again. A user asks for the weather, the agent has to plan, call the weather API, and respond. Even a simple two-step agent has to fit two LLM calls plus a tool call inside the user's perceived response time, or you start hearing dead air. With 200ms TTFT this is achievable. With 1 second TTFT it is not. ## Browser and code agents Browser agents and code agents have a different latency profile but the same structure. A browser agent loads a page, observes the DOM, decides what to click, clicks, waits for the page, observes again. A code agent reads files, decides what to edit, applies the edit, runs tests, reads output, decides the next step. In both cases, the user is willing to wait longer than they would for chat. A 30 second task is fine. A two minute task starts feeling slow. A five minute task often gets abandoned. The reason fast inference matters here is not that any single step has to be sub-second. It is that the number of steps the agent can afford grows with how fast each step is. A code agent that runs at 2 seconds per step is capped at maybe 30 steps before users give up, which limits the size of the task it can handle. A code agent that runs at 400ms per step can handle 100 steps in the same wall time, which is the difference between fixing a typo and refactoring a module. This is the deeper version of the inference tax: it does not just make agents slower. It makes some agent designs impossible. The product surface that you can build is constrained by the latency of the underlying model, not by the model's quality. ## What to measure If you are building agents, the model benchmarks worth tracking are not the same as the chat benchmarks. The ones that matter: - **Cached TTFT.** Time to first token when the system prompt is already in the KV cache. - **Short generation latency.** Total time to produce 50, 100, and 200 tokens. This is what each agent step actually looks like. - **Structured output latency.** Time to produce a valid JSON tool call, including any decoding constraints. Some serving stacks pay a real cost here. - **Concurrent step latency.** What happens to TTFT when N agent loops are running against the same endpoint. Throughput per agent matters as much as throughput per second. The standard "tokens per second on a 512 token completion" number tells you almost nothing about whether a model will work inside an agent. ## What changes at 10x A 10x speedup in agent step latency does not mean agents become 10x faster end to end. Tool execution and other fixed overhead set a floor. But it changes which patterns are practical: - Self-verification on every step becomes cheap. - Parallel candidate generation, where the agent produces two or three plans and picks the best, fits inside the same wall clock budget as a single plan today. - Long-horizon agents that take 50 to 100 steps stop being a research curiosity and start being shippable. - Voice agents stop having to choose between fast and capable. Most of the agent architectures in the literature were designed under the assumption that LLM calls are slow and expensive. As that assumption changes, the design space opens up. The agents that ship in two years will not look like ReAct loops with three retries. They will be wider, deeper, and more redundant, because the cost of being wrong is no longer measured in seconds of dead air. If you are running agentic workloads and the inference latency is what is bottlenecking your design, [General Compute's API](https://generalcompute.com) is built for short, sequential calls with aggressive prefix caching and low TTFT. It is the workload we optimized for. Pointing your agent at our endpoint is usually a few lines of config, and the wall clock difference shows up immediately. --- # Compiler-Level Optimizations for Inference: TorchInductor, Triton, XLA > How modern ML compilers turn Python model code into fused, fast kernels. A practical look at TorchInductor, Triton, and XLA, and the tradeoffs each one makes for inference. - Author: General Compute - Published: 2026-05-06 - Tags: inference, compilers, deep-dive - Canonical: https://www.generalcompute.com/blog/compiler-level-optimizations-for-inference --- If you have ever profiled a transformer forward pass, you have probably noticed that the model spends a surprising amount of time doing nothing useful. A small reshape here, a kernel launch there, a memory copy because two operators disagree on layout. The math is fine. The problem is everything around the math. Compilers exist to fix this. TorchInductor, Triton, and XLA all sit between high level model code and the hardware, and all three try to remove the same kind of waste: unnecessary launches, unnecessary memory traffic, and unnecessary precision. They take different paths to get there. This post walks through what each one does, where they overlap, and what an inference engineer should actually expect when they enable them. ## What "compiler" means in this context There are two compilers in any deep learning stack. There is the one that ships with the GPU vendor (nvcc for CUDA, ROCm's compiler for AMD), which turns C++ kernel code into machine instructions. Then there is the higher level ML compiler, which turns a graph of operators into a sequence of those kernels. When people say "compile your model," they almost always mean the second one. The ML compiler has three jobs: 1. **Trace the graph.** Capture the operations the model is doing, including their shapes and dtypes, into a representation it can manipulate. 2. **Rewrite the graph.** Fuse operators together, eliminate dead code, pick layouts, and choose algorithms that match the hardware. 3. **Generate kernels.** Emit code that the GPU vendor's compiler can compile down to actual instructions. Where the three frameworks differ is in how aggressively they do each step, and how much escape hatch they give you when their defaults are wrong. ## TorchInductor: the default that ships with `torch.compile` TorchInductor is the backend behind `torch.compile` in PyTorch 2.x. When you write `model = torch.compile(model)`, Inductor is what runs. Its design choice is to lean on Triton for kernel generation rather than reinventing CUDA codegen. Inductor takes a TorchDynamo-traced FX graph, lowers it through its own intermediate representation, applies a long list of fusion and simplification passes, and then emits Triton kernels for GPU and C++ kernels for CPU. The Triton kernels handle the parts that benefit from autotuning. The C++ side handles glue code and reductions that are easier to express in scalar form. For inference, the wins come from a few places: - **Pointwise fusion.** Activations like SiLU after a matmul, residual adds, layernorm scaling: all collapse into a single kernel. A typical decoder block might go from 15 launches to 4 or 5. - **Reduction fusion.** Softmax, layernorm, and RMSNorm fuse with whatever pointwise operations sit on either side, which means the intermediate tensors never leave registers. - **Buffer reuse.** The IR tracks which tensors are still needed, and reuses memory aggressively. For long context inference where activations are huge, this matters. - **Autotuning.** For matmul-shaped operations on supported configs, Inductor will benchmark a handful of Triton configurations at compile time and pick the best one for your shape. The catch is the compile time itself. The first call into a compiled model can take 30 seconds to several minutes, especially with autotuning enabled. For batch inference servers with stable shapes this is fine, since it amortizes. For agent loops with variable input lengths it can be painful, because shape changes trigger recompilation. The cure is `dynamic=True`, which tells Inductor to specialize on a few size buckets rather than every concrete shape, but you give up some peak throughput in exchange. The other catch is that Inductor still relies on the operator library underneath it. If you are calling FlashAttention through `torch.nn.functional.scaled_dot_product_attention`, Inductor does not generate the attention kernel itself. It dispatches to the FlashAttention implementation that PyTorch ships, and your speedup from compilation comes from everything around the attention call, not the call itself. ## Triton: the kernel language that everyone is building on Triton is the layer below Inductor, but it is also a language people use directly. It was designed to be the middle ground between writing CUDA by hand and waiting for a compiler to generate something good. You write Python that looks like NumPy but operates on blocks of values. Triton's compiler turns those blocks into the warp-level scheduling, shared memory layouts, and load patterns that a CUDA programmer would otherwise tune by hand. A simple Triton kernel looks like this: ```python import triton import triton.language as tl @triton.jit def add_kernel(x_ptr, y_ptr, out_ptr, n, BLOCK: tl.constexpr): pid = tl.program_id(axis=0) offsets = pid * BLOCK + tl.arange(0, BLOCK) mask = offsets < n x = tl.load(x_ptr + offsets, mask=mask) y = tl.load(y_ptr + offsets, mask=mask) tl.store(out_ptr + offsets, x + y, mask=mask) ``` That looks trivial, and for elementwise ops it is. The reason Triton matters is that the same style scales up to attention kernels, fused MoE routing, quantized matmuls, and custom paged-KV operations. FlashAttention 2, vLLM's paged attention kernel, and a large fraction of the custom kernels in modern inference servers are written in Triton, not CUDA. For inference work, Triton hits a particular sweet spot. You can prototype a fused kernel in a day, autotune it across a handful of block sizes, and ship something that gets within 10 to 20% of a hand-tuned CUDA kernel. That gap matters at frontier scale, but for most teams the speed of iteration is more valuable than the last 15% of throughput. Triton has limits. It does not support the full range of warp-specialized features that very recent NVIDIA hardware exposes (TMA, async copies in their newest forms, certain Hopper features), and that gap reopens whenever a new GPU generation lands. The Triton team usually catches up within a few months, but if you need day-zero performance on a brand new accelerator, you are still writing CUDA. ## XLA: the graph-first approach XLA started inside TensorFlow and is now the compiler underneath JAX, the TPU stack, and several other projects. It takes a different philosophy from Inductor. Instead of fusing operators opportunistically based on local patterns, XLA wants the entire computation handed to it as a static graph, and then it does whole-program optimization. The XLA pipeline goes: 1. **HLO (High Level Operations).** A small set of well-defined operations like `dot`, `reduce`, `gather`, `dynamic-slice`. The frontend lowers your model into HLO. 2. **Optimization passes.** Algebraic simplification, layout assignment, loop fusion, memory scheduling, sharding propagation. 3. **Code generation.** For TPU, an optimized backend that knows the chip's matrix unit, scratchpad layout, and async DMA patterns. For GPU, a backend that emits LLVM IR which is then compiled to PTX. What you get from XLA on the right workload is striking. On TPUs, where there is no real alternative, XLA is the only path to performance. On GPU, JAX-with-XLA can match or beat eager PyTorch by a wide margin for workloads where the graph is fully traceable. We have seen 2x to 4x improvements on dense decoder models when the input shapes are static and batched. The trade-off is rigidity. XLA assumes the graph is known up front. Variable shapes turn into recompilations. Control flow has to be expressed through `lax.cond` or `lax.scan`, not Python `if` statements. Dynamic KV caches, which are everywhere in inference, force you into either padding to a maximum length or using `dynamic-update-slice` carefully to avoid blowing up the compiled program size. JAX's `jit` machinery handles a lot of this for you, but the rough edges show up the first time you try to serve a model with variable batch and variable sequence length. For pure inference servers that handle one model with predictable batch sizes, XLA on GPU is competitive. For interactive workloads where requests show up with arbitrary lengths and the server has to handle them efficiently, the dynamic compilation cost usually pushes teams toward Inductor or a custom Triton based stack. ## Where these compilers actually overlap In practice, modern inference stacks are not built around one compiler. They use whichever one is best for each layer: - **Operator-level kernels.** Triton, almost always. FlashAttention, paged attention, fused MoE, quantized matmuls. - **Graph-level fusion and scheduling.** Inductor for PyTorch deployments, XLA for JAX or TPU deployments. - **Vendor primitives.** cuBLAS, cuDNN, and CUTLASS still handle the heavy matmuls when their kernels beat what the compiler generates. Inductor and XLA both know how to call into them. The interesting part is that the stack between your model code and the metal is no longer a single tool. It is a chain. PyTorch traces the graph with Dynamo, Inductor lowers it and decides which parts to fuse, Triton generates the fused parts, and the GPU vendor compiles Triton's output. Each link in the chain has its own performance characteristics, and a regression in any of them shows up at the top. ## What this means for an inference engineer A few practical takeaways from running these compilers in production: - **Always measure.** `torch.compile` does not always make things faster. For very small models or models where attention dominates and is already calling FlashAttention, the speedup can be small or negative. Profile before and after. - **Watch for recompilations.** Both Inductor and XLA recompile on shape changes by default. A serving loop that sees ten different sequence lengths can spend more time compiling than running. Use shape buckets or dynamic shapes deliberately. - **Triton is a power tool.** When the compiler does not fuse the way you want, writing the kernel yourself in Triton is no longer exotic. The barrier to entry is much lower than CUDA, and the resulting kernels integrate cleanly with Inductor and PyTorch. - **XLA shines for static workloads.** If you are running batch inference at fixed shapes, JAX plus XLA is genuinely fast and worth evaluating. For online serving, the dynamic shape story is harder. - **Compiler choice is not free.** Each stack has its own debugging story, its own version churn, and its own failure modes. Picking one means investing in tooling and people who can read its IR when things go wrong. The headline result is that compilers have closed a real gap between naive Python model code and hand-tuned CUDA. They are not magic. Stacks that lean on them still spend a lot of engineering time profiling, writing custom kernels, and tracking compiler updates. The difference is that the floor has moved up. A team that runs `torch.compile` on a well-structured model today gets performance that took a kernel specialist a quarter of work to achieve three years ago. If you are building inference infrastructure and want to compare your stack against something that has spent a lot of time on this exact problem, [General Compute's API](https://generalcompute.com) runs every step of the chain we just described, on hardware tuned for it. Try it in a few lines of code and see how the numbers compare. --- # Draft Model Selection for Speculative Decoding > Picking a draft model is the most consequential decision when deploying speculative decoding. A practical guide to acceptance rates, sizing, and the tradeoffs that decide whether you actually get a speedup. - Author: General Compute - Published: 2026-05-05 - Tags: inference, speculative-decoding, deep-dive - Canonical: https://www.generalcompute.com/blog/draft-model-selection-for-speculative-decoding --- We have written about [vanilla speculative decoding](/blog/speculative-decoding-3x-speedups-without-changing-the-model) and the [next generation of speculative decoding methods](/blog/medusa-eagle-sequoia-next-gen-speculative-decoding). The papers describe the algorithms, but they tend to gloss over the part that actually decides whether your deployment gets a 3x speedup or a 1.1x speedup: which draft model you pick. Choosing a draft model looks simple on paper. Pick something smaller than your target. Run it. Verify in bulk. In practice, the choice involves at least four trade-offs, and getting any of them wrong wastes most of the potential gain. This post is the practical guide we wish someone had handed us when we first put speculative decoding into production. ## The basic math: speedup is acceptance rate times pass count The speedup from speculative decoding is roughly: ``` expected_speedup ≈ (1 + α + α^2 + ... + α^k) / (1 + c) ``` Where `α` is the per-token acceptance rate, `k` is the number of draft tokens proposed per round, and `c` is the cost of the draft pass relative to the target pass. The numerator is how many tokens you get per target forward pass on average. The denominator accounts for the fact that drafting itself is not free. This formula is worth staring at for a minute, because it makes every selection trade-off concrete. A draft model with 90% acceptance and 5% draft cost crushes a draft model with 70% acceptance and 1% draft cost, even though the second one is much smaller. A draft model with 95% acceptance that costs 30% of a target pass loses to a 75% acceptance draft model that costs 4% of a target pass. There is no single correct answer. The correct answer depends on what acceptance rate you can actually achieve and how cheap your draft is compared to the target. ## Trade-off 1: size The most common rule of thumb is "draft should be 10x to 30x smaller than the target." That is roughly correct, but it hides what is really going on. What you want is a draft that is cheap enough to run that even modest acceptance rates are profitable. On a Llama 3.1 70B target, a Llama 3.2 1B draft typically runs in around 5% of the target's forward pass time, so even an acceptance rate of 60% gives a meaningful speedup. A 7B draft might hit 80% acceptance, but it costs 12 to 15% of the target pass, and the trade-off often comes out worse. There is also a hard floor. Below about 500M parameters, draft quality on real prompts (especially code, chain of thought, structured output) drops off a cliff. The acceptance rate falls into the 30 to 50% range, the speedup collapses, and you would have been better off without speculation at all. TinyLlama 1.1B is roughly the smallest model worth using as a general-purpose draft for production traffic. The practical sizing window for general-purpose draft models in 2026 is 1B to 3B parameters when the target is 30B or larger. Below 30B, the draft cost becomes a much bigger fraction of the target pass and you need to be more careful. ## Trade-off 2: vocabulary and family alignment This one bites people who try to mix and match models. Speculative decoding requires the draft and target to share a tokenizer. If they tokenize differently, you have to translate proposed tokens between vocabularies, and the verification step gets messy. Most production deployments avoid this by sticking to drafts and targets from the same model family. Beyond tokenizer alignment, family alignment also matters for acceptance rate. A Qwen 2.5 1.5B draft for a Qwen 2.5 72B target hits 75 to 85% acceptance on most prompts, because both models were trained on overlapping data with similar objectives. A Llama 3.2 1B draft for a Qwen 2.5 72B target, even after retokenization tricks, tops out around 50 to 60% because the two models disagree about token distributions in subtle but consistent ways. The general guideline: 1. Same family, same generation: best acceptance rate. Use this when available. 2. Same family, different generation (e.g. Llama 3.2 draft with Llama 3.1 target): usually fine, expect a few percentage points lower acceptance. 3. Different families: only when forced. The drop in acceptance is rarely worth it. ## Trade-off 3: distillation For a long time, the conventional wisdom was that you should distill your draft model from your target. The intuition makes sense. A distilled draft has been trained to mimic the target's exact output distribution, so the acceptance rate should be higher than an off-the-shelf small model. In practice, the gain from distillation is real but smaller than people expect, usually 5 to 10 percentage points of acceptance rate. That is enough to be worth doing if you serve at scale, but not enough to bother with for most deployments. The cost is that you now have a custom draft model that needs to be retrained every time your target model changes, which in 2026 is a meaningful operational burden. The exception is domain-specialized serving. If you serve mostly code, or mostly customer support chats, or mostly structured tool calls, distilling a draft model on traffic from your domain pushes acceptance rates into the 90% range. At that point the trade-off shifts. We have seen production code-completion deployments where a 1B distilled draft hits 92 to 94% acceptance against a 32B target, which is hard to beat with any off-the-shelf model. ## Trade-off 4: quantization of the draft Most people quantize the target model and forget about the draft. This is a mistake. The draft model's forward pass cost shows up directly in the speedup formula, and quantizing the draft (FP8 or INT4) cuts that cost roughly in half with minimal acceptance rate loss. The reason quantization is safer on the draft than on the target is that you do not actually need the draft to be accurate. You need it to propose tokens that the target will accept. Even if INT4 quantization shaves a few points off the draft's standalone perplexity, the verification step catches any divergent tokens, so the only cost is a slightly lower acceptance rate. In our experience that cost is usually 2 to 4 percentage points, while the latency savings are 30 to 50%. If your target is FP8 or BF16 and your draft is also full precision, you are leaving easy speedup on the table. Quantize the draft. ## How to actually measure your setup Two numbers tell you almost everything: 1. **Acceptance rate (α)**: the fraction of draft tokens that survive verification. Measure on real production traffic, not on benchmark prompts. Acceptance rate on MMLU-style multiple choice can be 20 percentage points higher than acceptance rate on free-form chat. Use what you actually serve. 2. **Mean accepted length per round**: how many tokens you commit per target forward pass on average. This is the metric your latency depends on. With `k` draft tokens proposed, mean accepted length is `(1 - α^(k+1)) / (1 - α)`. The marginal benefit of more draft tokens decreases fast as α drops. If your acceptance rate is below 65%, you have probably picked the wrong draft model. Go look at where rejections happen. Are they early in the response (which suggests a tokenizer or prompting mismatch) or late (which suggests the draft is fine for short patterns but loses coherence on longer continuations)? If your acceptance rate is above 90% and you are still not seeing the speedup you expected, your draft pass is too expensive. Quantize it, shrink it, or look at whether you have set k too high (proposing 8 tokens when you only ever accept 3 wastes draft compute). ## Picking k, the number of proposed tokens The right value of k depends on α. As a rough guide: - α around 60%: k = 3 or 4 - α around 75%: k = 4 or 6 - α around 85%: k = 6 or 8 - α above 90%: k = 8 or higher, sometimes a tree structure helps more than a linear chain If you are using EAGLE-2 or Sequoia, the algorithm picks the tree structure for you based on confidence, so you mostly stop worrying about k as a fixed parameter. For vanilla speculative decoding with a draft model, picking k is still a manual tuning step. ## When to skip speculative decoding entirely Speculative decoding helps in latency-bound serving (low concurrency, single-user requests, voice agents, autocomplete). It helps less, and sometimes hurts, in throughput-bound serving (large batch sizes, offline inference, batch jobs). The reason is that speculative decoding fundamentally trades extra compute for fewer sequential dependencies. When you are batching 64 requests, the GPU is already saturated on compute for every forward pass. Adding speculation does not buy you parallelism you did not already have, and the verification overhead can actually slow things down. Rule of thumb: if your time-to-first-token matters more than your tokens-per-dollar, speculative decoding is probably worth it. If you are running offline summarization on millions of documents and tokens-per-dollar is the only metric, it usually is not. ## What this looks like at General Compute The reason draft model selection matters so much on GPU is that the target forward pass is slow. When a single decode step takes 70 milliseconds, every additional token you can squeeze out of that pass is worth real money. The whole speculative decoding ecosystem exists because GPUs are bandwidth-bound on autoregressive workloads. General Compute serves on inference-optimized ASICs. The target forward pass is already fast, which changes the math on speculation. The savings per accepted token are smaller in absolute terms, but the latency floor is lower to start with, and techniques like speculative decoding still compound on top. In practice we see customers run smaller drafts (often 1B class) and lean harder on prefix caching and disaggregated prefill, because once the target is fast, the marginal value of speculation is bounded by how much draft cost you can amortize. If you are picking a draft model right now, the short version is: same family, 1B to 3B parameters, FP8 quantized, and measure acceptance rate on real traffic before you tune anything else. Get those four right and you will capture most of the available speedup. [Sign up at generalcompute.com](https://generalcompute.com) and get $200 in free credit to try inference where speculative decoding stops being the only thing keeping latency bearable. --- # The Attention Sink Phenomenon: Why the First Token Matters > How attention concentrates on the first few tokens of every sequence, why naive sliding-window caching breaks long-context generation, and how StreamingLLM uses sink tokens to serve effectively unbounded streams. - Author: General Compute - Published: 2026-05-04 - Tags: attention sinks, streamingllm, long context, kv cache, inference, transformers - Canonical: https://www.generalcompute.com/blog/the-attention-sink-phenomenon-why-the-first-token-matters --- If you visualize the attention weights of a decoder-only transformer halfway through a long generation, you see a strange pattern. Most of the probability mass goes where you would expect, onto the recent tokens and a handful of semantically relevant earlier tokens. But a surprisingly large share, often 30 to 50 percent, lands on the very first tokens of the sequence. The model is paying intense attention to a "BOS" token and the opening words of the prompt, even when those tokens have nothing to do with what is being generated right now. This is the attention sink phenomenon. It was named and characterized by Xiao et al. in the StreamingLLM paper (2023), but anyone who has poked at attention maps on a long-running model has probably seen it. The pattern is consistent across models, across layers, and across input distributions, which makes it more than a curiosity. It is a structural property of softmax attention, and it has direct consequences for how you serve LLMs in long-context and streaming settings. This post walks through what attention sinks are, why they exist, what breaks when you ignore them in a serving system, and how StreamingLLM uses them to enable effectively unbounded generation without retraining the model. ## The pattern in the attention maps Take a Llama-style model and feed it a long passage. Then, during decoding, look at the attention weights from any given layer to all previous tokens. You will see three rough bands: 1. A spike on the first one to four tokens of the sequence, regardless of what those tokens contain. 2. A more diffuse band of moderate weights on the most recent tokens, the ones in the local context window. 3. Lower, scattered weights on tokens in between, with a few peaks corresponding to semantically related words. The first band is the surprising one. The model is spending a real fraction of its attention budget on tokens that are not semantically related to the current generation step. If the prompt starts with "The following is a transcript of a customer support call," and the model is now 30,000 tokens deep into the call, those opening words still get heavy attention weight. The model is not retrieving information from them. It is using them as a sink. The behavior is most pronounced in middle layers. Early layers attend more locally. The deepest layers also attend somewhat locally. But somewhere in the middle of the stack, you see this strong pull toward the first tokens, layer after layer. ## Why this happens The mechanical reason is softmax. Self-attention computes attention weights as `softmax(Q K^T / sqrt(d))`, and softmax forces the weights to sum to one over the keys. The model cannot choose to attend to nothing. If there is no semantically relevant content elsewhere in the sequence, the attention head still has to put its weight somewhere. Tokens at the very start of the sequence end up serving as the default destination for "I do not need to attend anywhere specific." The first token is visible to every position in the sequence because of causal masking, so every query can see it. During training, the model learns that putting excess attention there is harmless, since those tokens already encode generic information about the start of the input. Over time, this becomes a stable equilibrium. Heads that do not need to retrieve information at a given step learn to dump their attention onto the initial tokens. You can think of it as a pressure-relief valve. The softmax must integrate to one, but heads do not always have meaningful work to do. The first tokens absorb the leftover probability mass. This is why removing the first tokens is so destructive. The model's attention budget is calibrated around the assumption that those tokens are present and absorbing extra mass. If you remove them, the softmax has to redistribute that mass onto other tokens. Now the heads that were silently sinking attention into the BOS token are loudly attending to whatever else is in the window, and that injects noise into the residual stream. Quality collapses fast. ## Why sliding-window caching does not just work The motivation for caring about attention sinks is practical. KV caches grow linearly with sequence length, and at long contexts, the cache eats most of your GPU memory. A single Llama 3 70B request at 128K tokens uses tens of gigabytes of KV cache. If you want to serve indefinitely long streaming sessions (voice agents, persistent assistants, very long documents), you eventually have to evict tokens from the cache. The simplest eviction policy is a sliding window. Keep the last N tokens, drop everything older. This is what classical RNNs and many older transformer variants approximate. For a transformer, it would seem natural: keep a window of size 4096, and as new tokens arrive, drop the oldest one to maintain the window. If you actually do this on a pretrained transformer at inference time, the model breaks. Perplexity climbs from a healthy single-digit number into the dozens or hundreds as soon as the window starts evicting the initial tokens. Generation degrades into incoherent text within a few hundred steps after the first eviction. This is the attention sink at work. The moment you drop those first few tokens, every middle-layer head that was sinking attention into them has nowhere to put its excess mass. The redistribution corrupts the hidden states, and the model loses coherence. ## What StreamingLLM actually does The StreamingLLM fix is small and almost embarrassingly simple. Keep the first few tokens, always. Then maintain a sliding window of recent tokens after that. The KV cache contains: ``` [sink_tokens (e.g., 4 tokens)] + [recent_window (e.g., 4092 tokens)] ``` The sink tokens are never evicted. The recent window slides as generation continues, dropping the oldest non-sink tokens to make room for new ones. Total cache size stays bounded. That is the entire algorithm. The reported results are striking: with as few as four sink tokens preserved, models like Llama 2 and Pythia maintain stable perplexity over generations of more than four million tokens. Without the sink tokens, the same models collapse within thousands of steps. A few details matter for getting this right in a real system: **Position encoding.** The model was trained with absolute or relative positions that grow linearly with sequence length. If you naively keep the original positions, the recent window's positions can exceed what the model saw during training, and rotary embeddings (RoPE) start producing out-of-distribution values. StreamingLLM re-encodes positions within the cache: the sink tokens stay at positions 0..k, and the recent window is mapped to positions k+1..k+W, regardless of how far into the stream you are. The model only ever sees positions inside the trained range. **Number of sink tokens.** Four is a common choice and works well in practice. One sink token works on some models but not all. The marginal benefit drops off quickly past four. The exact right number depends on how concentrated the attention sink behavior is in the model you are using. **What to use as sink tokens.** The original tokens of the prompt work. A small set of dummy tokens prepended at training time works better but requires retraining. For most deployments, just keeping the literal first few tokens of whatever the model saw is fine. ## Implications for serving systems Attention sinks change a few things about how you architect a long-context inference stack. For batched serving with paged KV caches (the vLLM / SGLang style of system), you can implement StreamingLLM as an eviction policy on top of the page table. Instead of evicting the least-recently-used pages, you mark the first few pages as pinned and evict only from the rest. This composes naturally with continuous batching. For streaming voice and chat, the practical effect is huge. You no longer need to truncate or summarize the conversation history to keep the cache bounded. You keep the first few tokens of the system prompt, slide a window over recent turns, and let the conversation run for hours without re-ingesting context or paying for an unboundedly large KV cache. Latency stays flat instead of growing with conversation length. For document processing, the calculus shifts a bit. If the document is a single coherent piece and you need to attend to its middle, sliding-window approaches throw away information that may matter. Sinks help with stability, not with global recall. For tasks where the model legitimately needs to retrieve information from the middle of a 200K-token document, you still want full attention over the whole context, with techniques like Ring Attention or chunked prefill carrying the load. The clean use case for streaming-with-sinks is sequential dialogue. The model only needs the recent context plus the framing tokens at the start. That is exactly what a long voice conversation or persistent agent session looks like. ## How this interacts with other long-context techniques StreamingLLM is not a replacement for long-context training. Models trained with longer contexts (RoPE scaling, YaRN, position interpolation) handle genuinely long single-shot inputs better than a sliding-window model can. What StreamingLLM offers is a way to keep generation stable beyond the trained context length, by ensuring the active attention pattern stays inside the distribution the model was trained on. It also pairs naturally with prefix caching. The sink tokens are usually inside the system prompt, which is shared across requests. If you are already caching the system prompt's KV across users, you are already keeping the sink tokens warm. The streaming policy just says "and never evict that prefix from the per-request cache during long sessions." Speculative decoding interacts cleanly too. The draft model and target model can both use sliding windows with sinks; the speculation logic does not care about cache management. The piece this does not solve is multi-turn retrieval over very long histories. If you need to recall a fact from 100K tokens ago in a streaming session, sliding-window attention has lost that information. The usual fix is external memory: store older turns in a vector database, retrieve relevant chunks as needed, and inject them into the recent window. The KV cache stays bounded, the relevant history stays retrievable, and the attention sink keeps the model coherent. ## Why this is worth understanding The attention sink is a good example of behavior that emerges from architectural details (softmax integrating to one, causal masking exposing the first tokens to everyone) rather than from anything explicit in the training objective. Understanding why it exists is what lets you design serving systems that work at long context lengths instead of collapsing. If you have ever wondered why your long-running chat session went off the rails after a certain point, or why a "just keep the last N tokens" cache eviction strategy ruined generation quality, the attention sink is a likely culprit. The fix is mechanical, costs almost nothing in compute, and makes streaming inference behave the way you would naively expect it to. At General Compute, fast inference is not just about FLOPs per token. It is about keeping the system stable across the kinds of long-running, high-throughput workloads that real applications produce: voice agents that stay alive through hour-long calls, coding assistants that hold a session open across many interactions, and customer-facing chat that does not get worse the longer it runs. Sink-aware cache management is one of the small architectural choices that lets that happen. If you are building long-running agents or streaming applications and want low-latency, sink-aware inference out of the box, take a look at the [General Compute API](https://generalcompute.com). The same OpenAI-compatible interface, with the cache policies that keep your sessions stable. --- # Mixture of Experts at Inference Time > How MoE routing actually works during serving, why sparse activation makes large models cheaper to run per token, and what changes for the inference stack. - Author: General Compute - Published: 2026-05-03 - Tags: mixture of experts, moe, inference, routing, sparse models, serving - Canonical: https://www.generalcompute.com/blog/mixture-of-experts-at-inference-time --- A 671B-parameter model that runs at the speed of a 37B-parameter model. That is roughly the pitch of DeepSeek V3, and Mixtral 8x22B, and Llama 4 Maverick, and most of the other large models that have shown up in the last year. They are all Mixture of Experts (MoE) architectures, and the trick they share is that only a small fraction of the parameters fire on any given token. The rest sit in memory unused for that step. This makes MoE attractive for inference: you get the quality of a much larger model without paying the per-token compute cost. The trade-offs show up in different places, mostly in memory bandwidth, in routing overhead, and in how you shard the model across GPUs. This post walks through what MoE actually does at inference time, how the routing decision works, and what the shape of an MoE serving deployment looks like compared to a dense model of similar quality. ## The basic shape of an MoE layer A standard transformer block has self-attention followed by a feed-forward network (FFN, usually called the MLP). In a dense model, every token goes through the same FFN, which is a pair of large linear projections with an activation in between. An MoE block replaces that single FFN with a set of N FFNs, called experts, plus a small router network that decides which experts each token uses. For each token, the router picks the top-k experts (commonly k=1 or k=2), runs the token through only those experts, and combines the outputs. The other N-k experts are not touched for that token. So if you have 8 experts and pick the top 2, you activate 2/8 = 25% of the FFN parameters per token. If you have 256 experts and pick the top 8 (DeepSeek V3's setup), you activate roughly 3% of the FFN parameters per token. The attention layers remain dense, so the savings only apply to the FFN portion of the model, but in modern LLMs the FFN is the bulk of the parameter count. This is why DeepSeek V3 has 671B total parameters but only 37B activated per token. The 37B is what you actually compute on; the 671B is what has to be in memory. ## How the router actually decides The router is a small network, usually a single linear layer that maps the token's hidden state to a logit per expert. Take the top-k logits, apply softmax to them, and you have a set of routing weights for that token's chosen experts. Mathematically, for a token with hidden state x: ```python gate_logits = router(x) # shape: [num_experts] top_k_indices, top_k_logits = top_k(gate_logits, k=2) top_k_weights = softmax(top_k_logits) ``` Then the output is a weighted sum of the chosen experts' outputs: ```python output = sum(top_k_weights[i] * experts[top_k_indices[i]](x) for i in range(k)) ``` Each token in a batch can route to a different combination of experts. Token 0 might go to experts 3 and 7, token 1 to experts 1 and 4, token 2 back to expert 3 paired with expert 0. There is no shared routing across the batch. This per-token routing is what makes MoE serving more complicated than dense serving. The work is no longer a uniform matmul over the batch. It is a scatter-gather: send each token to its chosen experts, run the experts, gather the outputs back in the original order. ## Why MoE is faster per token For the FFN computation, an MoE model with k=2 active experts out of 8 does roughly 1/4 of the FLOPs of a dense model with the same total FFN parameter count. The compute savings are linear in the activation ratio. This matters a lot at decode time. Decode is mostly a memory-bandwidth-bound operation for large models, but the FFN matmuls still take real wall-clock time. Cutting them by 4x or 30x (depending on the activation ratio) is a meaningful speedup. For prefill, the savings are similar but the picture changes. Prefill processes many tokens at once, so the matmuls are larger and more compute-bound. The router has to dispatch each prefill token to its experts, which gives you very irregular work per expert. Some experts get many tokens, some get few. This load imbalance is where most of the implementation difficulty in MoE serving lives. ## Memory does not get smaller Here is the catch that surprises people. The compute is sparse. The memory footprint is not. To serve an MoE model, all the experts have to be in GPU memory, ready to be called on. You do not know in advance which experts a token will route to, so you cannot leave any of them on disk or in CPU memory without paying a load latency penalty. A 671B-parameter MoE model in FP8 takes about 671 GB of GPU memory for weights, the same as a 671B dense model would. This means MoE models are large in memory but small in per-token compute. The arithmetic intensity (FLOPs per byte read) goes up, because you are reading more weights per token than you actually compute on. For decode, where you are already memory-bandwidth-bound, this can hurt: you might be reading the routed expert's weights at full bandwidth and not getting any speedup from the sparsity, because the bottleneck moved. In practice, MoE models still serve faster than dense models of comparable quality, because the dense equivalent would need many more parameters to match performance. A 37B-active MoE often matches or beats a 70B dense model. So you compare 671B memory at 37B compute against 70B memory at 70B compute, and the MoE wins on per-token speed even though it loses on total memory. ## Expert parallelism When the model is too large for one node, you have to shard the experts across GPUs. The natural way to do this is expert parallelism (EP): each GPU holds a subset of the experts. With 64 experts across 8 GPUs, each GPU holds 8 experts. Now routing becomes a network operation. Each token has to be sent to whichever GPU holds its chosen expert, run through the expert there, and the result has to come back. This is an all-to-all communication: every GPU has tokens going to every other GPU's experts. The all-to-all is the dominant cost in expert-parallel MoE serving. On NVLink within a node, it is fast. Across nodes over InfiniBand, it is much slower, and tuning the all-to-all becomes one of the main things separating a fast MoE serving stack from a slow one. Libraries like DeepEP and the all-to-all kernels in Megablocks exist specifically to make this efficient. EP combines with tensor parallelism (TP) and pipeline parallelism (PP) for very large models. A typical shape for a 671B MoE on 16 H100s might be EP=8, TP=2, PP=1: each pair of GPUs runs TP across the dense parts (attention, router), and each group of 8 holds the experts split across them. ## Load balancing and the imbalance problem The router is trained to spread tokens roughly evenly across experts, but at inference time the distribution is whatever the router picks for the current input. If most tokens in a batch route to expert 3, then GPU 0 (which holds expert 3) is doing most of the work and the other GPUs are idle. The all-to-all bandwidth is also imbalanced, because all the tokens are flowing toward one GPU. Two common mitigations: - **Capacity factor**: cap the number of tokens per expert at some multiple of the average. If too many tokens want expert 3, the lowest-priority ones get dropped to their second-choice expert. This caps the worst-case latency at the cost of some quality. - **Drop and reroute**: similar idea, but the dropped tokens skip the expert layer entirely (a no-op replaces their FFN computation). Easier to implement, slightly worse for quality. For inference, neither is great. Both add complexity and slightly degrade output quality. The current best practice is to make the all-to-all kernel fast enough that imbalance does not matter much, and to use a router with explicit balance-aware logic at training time. ## Shared experts and fine-grained MoE DeepSeek's architecture introduced a wrinkle that has been adopted by several follow-up models: shared experts that are always active, plus routed experts that are picked per token. So instead of the FFN being entirely a routing decision, a portion of it is always computed (the shared expert handles common patterns) and the routed experts add specialization on top. This stabilizes training and makes the routing decisions less load-bearing. From an inference perspective, the shared expert is a normal dense FFN computation, and the routed experts add the MoE machinery on top. Total compute per token goes up slightly compared to pure top-k routing, but the quality-per-FLOP improves. DeepSeek V3 also uses fine-grained MoE: instead of 8 large experts, it has 256 small experts and routes to 8 of them. Each individual expert is smaller, so the routing decision is more granular, and the activation ratio drops. The network is doing more bookkeeping per token but each piece of bookkeeping touches less compute. Fine-grained routing puts more pressure on the all-to-all. With 256 experts, each token's two chosen experts are spread across more GPUs in a typical EP layout, so the communication pattern is denser. The DeepSeek paper spent significant effort on the kernel implementations to make this work at production speeds. ## What changes for the inference stack Compared to serving a dense model, an MoE serving stack has to handle: - A routing decision per token, per layer. The router is small, but it runs on every token and has to be efficient. - Token dispatch and gather kernels. The fused permutation kernels (like the ones in vLLM, SGLang, and TensorRT-LLM) are critical, because the naive scatter-gather is slow. - All-to-all communication when expert parallelism is used. This needs to overlap with compute as much as possible. - Variable per-expert workloads. Some experts get more tokens than others within a batch, and the kernel has to handle that without serializing. - Memory layout choices for the expert weights. Some implementations store them as one big tensor with strided access, others as separate per-expert tensors, with different cache and bandwidth implications. For most users, this is hidden behind the serving framework. You ask vLLM or TensorRT-LLM to serve Mixtral or DeepSeek V3, and the framework handles the routing. But if you are debugging slow MoE serving, the usual suspects are the all-to-all (when EP is used), the dispatch kernel (when batch sizes are awkward), and load imbalance (when the router routes badly for your workload). ## Where MoE fits For serving, MoE is increasingly the default for very large models. The compute savings make it possible to run 100B-active-parameter-equivalent quality at 30B-active speeds. The memory cost is real but mostly affordable on the multi-GPU nodes that are needed for these models anyway. For smaller models (under 30B total), dense is usually still the right answer. The routing overhead and the implementation complexity are not worth it when the dense model already fits in one or two GPUs and runs quickly. The interesting middle ground is models in the 50B to 200B parameter range, where MoE versus dense is a genuine architecture choice. Here the trade-off depends on your workload: latency-sensitive serving with small batches favors dense (no routing overhead, no all-to-all); throughput-oriented serving with large batches and many concurrent requests favors MoE (the all-to-all amortizes well over batch size, and the per-token compute savings stack). ## Closing The fundamentals are straightforward. MoE replaces a single FFN with a routed set of FFNs, only a few of which fire per token. You save compute, you spend memory, and you take on some new infrastructure complexity around routing and all-to-all communication. For models large enough that compute would otherwise be the binding constraint, the trade is worth it. For smaller models, dense is simpler and just as fast. If you want to serve MoE models without setting up the EP topology, the dispatch kernels, and the all-to-all tuning yourself, General Compute runs models like DeepSeek V3 and Llama 4 Maverick on inference hardware where the routing infrastructure is already in place. Same OpenAI-compatible API as any other model. Try it at [generalcompute.com](https://generalcompute.com). --- # Tensor Parallelism vs Pipeline Parallelism for Model Serving > How tensor and pipeline parallelism actually differ in production inference, when to use each, and why most serving stacks end up combining them. - Author: General Compute - Published: 2026-05-02 - Tags: tensor parallelism, pipeline parallelism, inference, distributed, gpu, serving - Canonical: https://www.generalcompute.com/blog/tensor-parallelism-vs-pipeline-parallelism-for-model-serving --- Once a model stops fitting on a single GPU, you have to split it across several. There are a handful of ways to do that, but for inference, the two that matter are tensor parallelism and pipeline parallelism. They look superficially similar (both shard a large model across multiple devices) but they have very different performance profiles, and picking the wrong one for your workload can cost you a factor of two or more on either latency or throughput. This post walks through what each one does mechanically, how communication patterns change the bandwidth requirements, and how to decide between them (or, more often, how to combine them) for a real serving deployment. ## Why a single GPU sometimes is not enough A 70B-parameter model in FP16 is 140 GB of weights. An H100 has 80 GB of HBM. The arithmetic does not work, even before you account for the KV cache, activations, and the workspace memory the kernels need. You either quantize aggressively, or you split the model. Splitting also helps when you have memory headroom but not enough compute. A 13B model fits on one GPU, but if you want to serve it with a 1,000 ms time-to-first-token budget on 32K-token prompts, a single GPU might not have the FLOPs. Spreading the work across several GPUs can pull latency down even when memory is not the binding constraint. Tensor parallelism and pipeline parallelism are the two main answers. They are not exclusive. Most large-model deployments use both at once. ## What tensor parallelism actually does Tensor parallelism (TP) splits the work inside each layer across GPUs. Take a linear projection that maps a hidden vector of size H to an output of size O. The weight matrix is H by O. With TP across N GPUs, you cut the matrix along the output dimension, so each GPU holds H by O/N weights and computes its own slice of the output. After the projection, you concatenate or all-reduce to recover the full result. Megatron-LM popularized a particular pattern for transformer blocks. The QKV projection is sharded along the head dimension, so each GPU owns a subset of attention heads. The attention computation runs locally on those heads. The output projection is sharded along the input dimension, which means each GPU produces a partial sum, and an all-reduce at the end collapses those partials into the final output. The MLP follows the same column-then-row pattern: the first linear is sharded column-wise (no communication needed before the activation), the second is sharded row-wise (one all-reduce after). Two all-reduces per transformer block. That is the cost of tensor parallelism, and it is paid on every forward pass, every layer, every token. Because the all-reduces happen inside the block, they are on the critical path for that token's compute. You cannot hide them behind other work the way pipeline parallelism can. The interconnect between GPUs has to be fast enough that the all-reduce does not stall the matmuls. This is why TP almost always lives within a single node. NVLink between H100s in the same DGX gives you about 900 GB/s per direction. PCIe gives you 64 GB/s on a good day. If you try to do TP across PCIe or, worse, across an InfiniBand fabric between nodes, the all-reduce latency dominates and you lose more than you gain. The practical limit on TP is the number of GPUs in one NVLink domain. On a standard 8-way H100 node, that is 8. On systems with NVLink switches and larger NVL domains, it can go higher. Beyond that, you usually run out of useful interconnect. ## What pipeline parallelism actually does Pipeline parallelism (PP) splits the model across layers, not within them. Suppose you have an 80-layer model and 4 GPUs. Pipeline parallelism puts layers 1 to 20 on GPU 0, 21 to 40 on GPU 1, 41 to 60 on GPU 2, and 61 to 80 on GPU 3. A request flows through the GPUs in sequence: GPU 0 processes its 20 layers, sends the activation to GPU 1, GPU 1 processes its 20 layers, and so on. The communication between stages is small compared to TP. You only send the activation tensor for the boundary between stages, which is a single hidden state per token. On a 4K-token prompt with hidden size 8192 in FP16, that is 64 MB per stage boundary. That is a single point-to-point send, not a collective, and it can run over a slower interconnect without much penalty. The catch with pipeline parallelism is the bubble. If you only have one request in flight, GPU 0 is busy for the first chunk of time, then idles while GPU 1 works, then GPU 2, then GPU 3. Three out of four GPUs are doing nothing at any given moment. That is terrible utilization. The standard fix is microbatching. Split a batch of requests into microbatches and pipeline them. While GPU 1 processes microbatch 1, GPU 0 starts on microbatch 2. With enough microbatches in flight, all the GPUs stay busy most of the time. There is still a startup bubble at the front of the pipeline and a drain bubble at the end, but the steady state is high utilization. For training, this is well understood. For inference, it is more subtle, because requests come in at different times and have different lengths. ## Why pipeline parallelism is awkward for low-latency inference In a training step, you decide on a batch and run it through the pipeline. There is no real-time constraint. The bubble matters for throughput, but every microbatch eventually completes. In serving, two things complicate pipeline parallelism. First, time-to-first-token includes the full pipeline depth. A request has to traverse every stage before the first token comes out. If each stage takes 50 ms on its share of prefill, a 4-stage pipeline gives you a 200 ms TTFT just from pipeline traversal. You do not get the speedup that tensor parallelism gives, where every GPU contributes to the same prefill in parallel. Second, decode is sequential by nature. Each generated token depends on the previous one. So during decode, the pipeline runs one token at a time through the whole pipeline before the next token can start. A 4-stage pipeline is not 4 times faster at decode, it is roughly the same speed as a single GPU (assuming the same per-stage compute), because each token waits for the full pipeline traversal. The fix during decode is to have many concurrent requests, so the pipeline stays full of work even though each individual request only sees one token at a time. Continuous batching helps a lot here. As soon as a request finishes its decode at one stage, the next request starts at that stage. The pipeline is full of decoded tokens at different positions. The result: pipeline parallelism is good for throughput when you have many concurrent requests, and bad for latency when you do not. Tensor parallelism is the opposite: latency stays low even at small batch sizes, but it scales poorly past one node. ## Memory and weights, not just compute The split also affects how the model fits in memory. With TP across 8 GPUs, each GPU holds 1/8 of every weight matrix. The model is uniformly distributed. If you want to add another transformer layer, every GPU has to find space for its share. With PP across 4 stages, each stage holds a quarter of the layers. The split is by layer, not by tensor. Adding more layers means putting them on whichever stage has room. This is sometimes useful for unbalanced models or for fitting in heterogeneous hardware, where you have, say, one 80 GB GPU and three 40 GB GPUs. KV cache memory works differently in the two regimes. With TP, the KV cache is also sharded across the heads, so each GPU stores 1/N of the per-token KV. With PP, each GPU stores the full KV for its layers. For long-context workloads, this matters. A 128K-context request with 80 layers needs to keep KV for all 80 layers somewhere. PP spreads that across stages naturally; TP keeps each layer's KV on the same set of GPUs. ## Combining them: 2D parallelism Most serving deployments above one node use both. A common shape is TP=8 within a node, PP=N across nodes. The 8 GPUs in a node share NVLink and run tensor parallelism over the high-bandwidth fabric. The pipeline stages run across the slower inter-node InfiniBand links, where the small point-to-point activation transfers do not stall. This 2D parallelism gives you the latency benefits of TP for the compute inside each pipeline stage, and the scalability benefits of PP for going beyond one node's worth of memory. The bubble cost of PP is manageable because you only have a few stages, and continuous batching keeps the pipeline full. For a 405B-parameter model, you might run TP=8, PP=2 on two 8-GPU nodes, totaling 16 GPUs. For a 1T-parameter MoE model, TP=8, PP=4 across four nodes is a common shape. The exact numbers depend on context length, batch size, and what you are optimizing for, but the pattern is consistent: TP within nodes, PP across them. ## Practical decision rules A few rough heuristics that hold up in production: - If your model fits on one GPU and you want lower latency, you do not need either. Just serve it on one GPU. Multi-GPU inference always has overhead. - If the model fits in one node but not one GPU, use TP across the GPUs in the node. Skip pipeline parallelism, it adds latency without benefit. - If the model is too big for one node, use TP within nodes and PP across them. Set TP equal to the number of GPUs per node, and pick PP based on the total weights and KV memory you need. - If your workload is heavily latency-sensitive at low batch sizes (voice agents, coding agents with short prompts), favor more TP and less PP. The bubble cost dominates at low batch. - If your workload is throughput-oriented at high batch sizes (offline batch jobs, bulk RAG), more PP is fine, sometimes preferable, because the pipeline stays full and the per-request latency can be amortized. The wrong answer is usually doing PP across PCIe within a node, or TP across InfiniBand between nodes. Both of those make the communication cost of the chosen scheme misalign with the available bandwidth, and you lose throughput, latency, or both. ## Sequence parallelism and other variants There are extensions to this two-axis picture. Sequence parallelism splits along the sequence dimension to reduce activation memory inside TP. Expert parallelism, used in MoE models, shards the experts across GPUs in a way that overlaps with TP and PP. Context parallelism (sometimes called Ulysses or Ring Attention) shards the attention computation across the sequence axis, which is critical for very long contexts. For most workloads under 128K context with dense models, you do not need to think about these. Plain TP within a node and PP across nodes is enough. When you start serving 1M-token contexts or trillion-parameter MoE models, the picture gets more complicated, and the trade-offs shift. ## Where this fits in a serving stack vLLM, TGI, TensorRT-LLM, and SGLang all support TP out of the box. PP support is more uneven, and the interaction with continuous batching is where implementations differ the most. If you are choosing a stack for multi-node inference, the quality of the pipeline scheduler matters more than the quality of the kernels, because the pipeline scheduler is what determines whether the bubble eats your throughput. For most users, this is invisible. You set TP=8 in a config, the runtime handles the rest, and you get the model served. But when something is slow, knowing which axis of parallelism is paying for what helps you debug it. A slow TTFT is often a TP problem. Low GPU utilization with high latency is often a PP scheduling problem. Memory pressure on one GPU but not others usually means an uneven pipeline split. ## Closing Tensor parallelism and pipeline parallelism solve overlapping problems with different trade-offs. TP gives you low latency and good utilization, but it needs fast interconnect and stops scaling past one node. PP scales as far as you have nodes and bandwidth, but it adds latency and only works well with enough concurrent requests to keep the pipeline full. The combination, TP inside nodes and PP between them, is the default for serving anything bigger than a single node can hold. The remaining work is tuning the exact shape to your workload, which mostly comes down to whether you are optimizing for tail latency on small batches or throughput on large ones. If you want to skip the parallelism tuning entirely, General Compute serves these models on custom inference hardware where the parallelism strategy is already chosen for you. Same OpenAI-compatible API, no config files to tune. Try it at [generalcompute.com](https://generalcompute.com). --- # Prefix Caching: Why Repeated Prompts Shouldn't Cost You Twice > How prefix caching works in modern LLM serving stacks, why it changes the economics of long system prompts and RAG, and what to watch out for in production. - Author: General Compute - Published: 2026-05-01 - Tags: prefix caching, kv cache, inference, vllm, sglang, production - Canonical: https://www.generalcompute.com/blog/prefix-caching-why-repeated-prompts-shouldnt-cost-you-twice --- Look at the prompts your application actually sends to an LLM and you will notice something embarrassing. Most of them share long stretches of identical text. The system prompt is the same on every call. The few-shot examples are the same. The retrieved documents repeat across users. The agent's tool definitions are the same. The conversation history is the same as it was a turn ago, plus a little bit at the end. Without prefix caching, the model recomputes the KV cache for all of that shared text on every single request. You pay for the prefill, the GPU sits there crunching attention over tokens it has already processed a thousand times today, and your users wait. Prefix caching is the fix, and on workloads that have any meaningful prompt overlap, it is one of the largest practical speedups available to a serving stack. This post walks through how prefix caching actually works, what the different implementations do differently, and the operational details that decide whether you get the full speedup or a fraction of it. ## Why prefill dominates so often To see why prefix caching matters, it helps to remember what prefill is doing. When a request comes in with N input tokens, the model runs a forward pass over all N tokens to produce the KV cache that decoding will use. The compute cost of that forward pass is roughly proportional to N for the feedforward layers and to N squared for attention. Most production prompts are heavy on input and light on output. A coding agent might send 8,000 tokens of context and expect 200 tokens back. A RAG application might send 4,000 tokens of retrieved chunks and ask for a one-paragraph answer. A voice agent sends a system prompt plus a few turns of history and gets a sentence in response. In all of these, prefill is doing more work than decode, often by an order of magnitude. If 90% of those input tokens are identical to a previous request, recomputing them is pure waste. The KV cache that the prefill pass produces is a deterministic function of the input tokens (and the model weights, and the position embeddings, but more on that later). If you computed it once already and you still have it sitting in GPU memory, you can reuse it. ## The basic idea Prefix caching is straightforward in principle. When a request arrives, hash its prompt prefix in token-aligned blocks. For each block, check whether you already have its KV cache stored. If you do, skip the prefill for that block and link the existing cache into the request. Run prefill only on the suffix that you have not seen before. The unit of caching is usually a block of fixed size, the same blocks that PagedAttention uses to manage KV memory. A block is typically 16 or 32 tokens. You hash the contents of the prefix up to and including each block boundary, and use that hash as a key into a cache of physical KV blocks. Two requests share a cached block if and only if they have the same tokens up to that block boundary. Once they diverge by even a single token, the rest of their KV is different, and the cache lookup stops. The savings depend entirely on how much of your traffic shares prefixes. A naive chatbot with no system prompt and no shared history will see almost no benefit. A coding agent where every request starts with 5,000 tokens of identical instructions, tool definitions, and project context will see prefill latency drop by something close to that 5,000-token fraction. ## Block-level caching in vLLM vLLM's prefix caching is the most widely deployed version of this. It builds on top of PagedAttention, which already chops the KV cache into fixed-size blocks for memory management. Each block has a hash computed from the tokens it contains plus the hash of the previous block. This makes the hash a function of the entire prefix up to that block, not just the local token contents. When a new request comes in, vLLM walks its prompt token by token, computing block hashes as it goes. For each block hash, it checks the cache. Hits get linked into the new request's block table; misses trigger prefill for that block onwards. Eviction is LRU on physical blocks that are not currently referenced by any active request. A block can be referenced by multiple requests at once if their prefixes match, so reference counts matter. When memory gets tight, vLLM frees the least recently used unreferenced blocks first. The implementation cost is small. The hash computation is cheap, the lookup is a hash map probe, and the existing PagedAttention machinery already handles non-contiguous KV blocks during attention. The end result is that you get prefix sharing without changing the attention kernel. ## RadixAttention and tree-structured caching SGLang's RadixAttention takes the idea further. Instead of a flat hash table of blocks, it stores cached prefixes in a radix tree. The tree's edges are token sequences and its nodes hold KV cache references. Inserting a prompt walks down the tree, extending paths and splitting nodes when prompts diverge. The tree structure makes some things cleaner. Looking up the longest matching prefix is a tree walk, which naturally finds the longest shared subpath. Eviction can be done at the granularity of tree nodes, with LRU on subtrees that are no longer referenced. And because the tree explicitly represents the relationships between prompts, you can reason about cache structure more easily when debugging. In practice the difference between block-hashed caches and radix-tree caches is mostly about implementation taste rather than raw performance. Both achieve the same asymptotic behavior on the same workload. The tree version tends to be a slightly better fit for workloads where prompts naturally form a tree, like agent traces that fan out from a common root. The block-hashed version is simpler and integrates more directly with paged KV memory. ## Position embeddings and why they matter Here is the detail that catches people. The KV cache is not just a function of the tokens. It is a function of the tokens at specific positions. If your model uses RoPE or any other position-dependent encoding, the K and V vectors at position 100 are not the same as the K and V vectors for the same token at position 200. This is fine when prefixes are reused at the start of the prompt, because the positions are identical. The first 1,000 tokens of the system prompt are always at positions 0 through 999. Cache hit, position match, you are done. It breaks when you try to reuse a cached chunk in the middle of a prompt. If you cache the KV for a paragraph at positions 500 through 999 in one request, you cannot just splice it into another request where the same paragraph appears at positions 1500 through 1999. The Q vectors at the new positions will not produce correct attention scores against K vectors computed at the old positions. The clean solutions are limited. Either you only cache prefixes (which is what prefix caching does, and why it is called that), or you use a model architecture that is robust to position shifts, or you do extra math to "shift" the cached KV to its new positions. CacheBlend and a few other research papers have explored the third option. In production, the first option is dominant because it is simple and correct. The practical consequence is that if you want the most caching, structure your prompts so that the variable content goes at the end. System prompt first, then few-shot examples, then retrieved documents (if they are stable across requests), then the user query last. This pushes the cache boundary as far to the right as possible. ## Where the wins actually come from In real deployments, the workloads where prefix caching pays off the most tend to follow a pattern. **Agents with long system prompts.** A coding agent might have 4,000 tokens of system prompt: tool definitions, formatting rules, examples of good and bad responses, a description of the codebase conventions. Every request the agent makes starts with that prompt. Prefix caching means the agent's per-request prefill cost is dominated by the user's actual query and the recent context, not the boilerplate. **Multi-turn chat.** Each turn's prompt is the previous turn's prompt plus a new user message and a new assistant response. The KV cache for everything except the new tokens already exists. Without prefix caching, every turn does a full prefill of the entire conversation. With it, each turn prefills only the deltas. **RAG with stable document sets.** If your system prompt and your retrieved documents are stable across many requests (think: a customer support bot grounded in a fixed knowledge base, or an agent operating on a single project), prefix caching keeps the KV for those documents warm. **Batch-style evaluation.** Running the same prompt template over many inputs is the canonical case. The shared template is cached once, every example pays only for its variable suffix. The wins are smaller or zero on workloads where every prompt is genuinely different. Open-ended chat with no system prompt, semantic search over user queries, or one-shot tasks with unique inputs do not benefit much. ## Eviction and capacity planning The size of the prefix cache is whatever GPU memory is left after the working set of active requests. On a busy server, this is often less than you would hope. Each active request has its own KV cache that is pinned and cannot be evicted, and the prefix cache competes with that for the same memory pool. Tuning here is mostly about throttling. If you let your batch size grow without bound, the prefix cache shrinks until it cannot hold even the system prompt, and your hit rate collapses. A reasonable default is to leave a meaningful fraction of KV memory, say 20 to 40 percent, available for prefix caching after subtracting expected concurrent request memory. Eviction policy matters less than people expect. LRU on unreferenced blocks does fine in almost all cases. The pathological workload would be one where you cycle through more distinct prefixes than fit in cache, but that is rare in practice. The much more common failure is undersizing the cache, not picking the wrong eviction policy. ## What to instrument If you are running an LLM serving stack with prefix caching enabled, the metrics worth watching are the hit rate (fraction of input tokens served from cache) and the prefill token reduction (raw count of tokens skipped). Hit rate by itself can be misleading because cache hits on short prefixes do not save much, while a single cache hit on a 4,000-token prefix can cut your prefill time in half. If your hit rate is lower than you expected, the usual suspects are: prompts not actually being identical (whitespace differences, ordering of fields in serialized JSON, stable but not literally equal templates), variable content being placed too early in the prompt, or the cache being too small. All three are fixable once you can see them. ## Closing Prefix caching is one of the few inference optimizations that is purely free. It does not change the model output, it does not require retraining, and it does not introduce new failure modes. On workloads with shared prefixes, which is most production workloads, it is the single highest-leverage optimization a serving stack can add. If you are running open-source inference servers, prefix caching is on by default in vLLM and SGLang at recent versions. If you are running on a managed inference provider, it is worth confirming whether they have it enabled and whether their cache survives across your requests. Speed work tends to involve heavy lifting; this one mostly just involves checking the box. General Compute serves models on infrastructure designed around exactly these inefficiencies. If you are tired of paying for the same tokens twice, [try our API](https://generalcompute.com) and see what your prompts look like when the prefix is already cached. --- # Distillation for Inference: How Smaller Models Learn From Larger Ones > A practical guide to knowledge distillation for production inference: what actually works, what to skip, and how to ship a smaller model without losing the behavior you cared about. - Author: General Compute - Published: 2026-04-30 - Tags: distillation, inference, model compression, training, production - Canonical: https://www.generalcompute.com/blog/distillation-for-inference-how-smaller-models-learn-from-larger-ones --- If you serve LLMs in production, you have probably stared at the same tradeoff for a while now. The big model is good. The big model is also slow and expensive. The small model is fast and cheap, but it makes the kind of mistakes your users notice immediately. Distillation is the standard answer to this gap, and it actually works, but the literature on it is a mess of techniques that sound similar and behave very differently in practice. This post is about what distillation looks like once you stop reading papers and start running training jobs. We will cover the different flavors of distillation, when each one is the right tool, and the operational details that decide whether the distilled model ends up in production or in a Slack thread titled "why we paused this project." ## What distillation actually is Knowledge distillation is the practice of training a smaller "student" model to imitate a larger "teacher" model. The teacher has already been trained, usually at considerable cost. The student gets to skip most of that work and instead learn from the teacher's outputs, which are richer than the raw labels in your dataset. The original framing from Hinton, Vinyals, and Dean in 2015 is still the cleanest way to think about it. A normal classifier sees a label like "this image is a cat" and learns to push probability toward "cat." A distilled student sees the teacher's full output distribution: 87% cat, 9% lynx, 3% dog, 1% everything else. Those soft targets carry information about how the teacher organizes the world. The student learns not just what the answer is, but how confident the teacher is and what the plausible alternatives are. For LLMs the story is the same, just over vocabularies of 100k tokens instead of 1k image classes. Every time the teacher predicts a next token, it produces a probability distribution over the entire vocabulary. The student tries to match that distribution. This is far more informative than just training the student on the argmax token, because it teaches the student where uncertainty lives. ## The three flavors that matter in practice Distillation comes in many variants in the literature. In production you will mostly run into three of them. ### Response-based distillation The student is trained to match the teacher's output token distributions. You run the teacher across a corpus of prompts, save the logits or top-k probabilities, then train the student with a KL divergence loss against those distributions. Sometimes called soft-target distillation or logit distillation. This is the workhorse approach for LLMs. It is cheap to set up, it composes with normal language modeling losses, and the data you generate (prompt plus teacher distribution) is reusable across multiple student training runs. ### Feature-based distillation The student is trained to match the teacher's intermediate activations, not just the output. You pick layers in the teacher and corresponding layers in the student, and add an MSE loss between their hidden states. The "FitNets" paper from 2014 introduced this, and variants have appeared regularly since. This works well when the student architecture is similar to the teacher's, just narrower or shallower. It struggles when the architectures diverge, because there is no natural correspondence between layers. For most LLM distillation projects you can ignore this until response-based distillation has stopped giving you gains. ### On-policy distillation The student generates its own outputs, and the teacher scores them. The student is trained to make outputs that the teacher rates highly. This is essentially RLHF with the teacher acting as the reward model. It is more expensive than response-based distillation because you need to run the student during training and then run the teacher on the student's outputs, but it directly optimizes for behaviors the student can actually produce. This matters more than it sounds. In response-based distillation, the student is trained on the teacher's continuations of prompts. But the student, once deployed, will be continuing its own previous tokens, not the teacher's. There is a distributional mismatch between training and inference. On-policy distillation closes that gap. ## What you actually distill The framing of "match the teacher's outputs" hides a real decision: which outputs? You can distill on: - **Your existing training data.** Run the teacher over your prompts, capture distributions, train the student. This is the simplest case. - **Synthetic data the teacher generates.** Have the teacher complete prompts (real or templated), and use those completions plus distributions as training data. Most modern small models that punch above their weight, including the Phi family and several Qwen sizes, lean heavily on this. The teacher both produces the inputs and provides the supervision. - **Targeted distributions.** If you care about specific behaviors (math, JSON output, refusals, tool calls), generate prompts that exercise those behaviors and distill on those. This is where distillation stops being a generic compression technique and starts being a behavior transfer technique. The "targeted" version is the one that gets the most bang for the buck in production. If you have a 70B model that handles your customer support queries well and you want to ship a 7B replacement, you should not distill on a generic web corpus. You should distill on the actual distribution of queries you serve, plus edge cases and adversarial inputs, with the teacher producing high-quality responses you can train against. ## A minimal distillation loop Here is the shape of a basic response-based distillation training step. Real implementations have more bookkeeping, but this is the core. ```python import torch import torch.nn.functional as F def distillation_step(student, teacher, batch, temperature=2.0, alpha=0.5): input_ids = batch["input_ids"] labels = batch["labels"] with torch.no_grad(): teacher_logits = teacher(input_ids).logits student_logits = student(input_ids).logits # Distillation loss: student matches teacher's softened distribution. soft_targets = F.softmax(teacher_logits / temperature, dim=-1) soft_log_preds = F.log_softmax(student_logits / temperature, dim=-1) distill_loss = F.kl_div( soft_log_preds, soft_targets, reduction="batchmean" ) * (temperature ** 2) # Standard language modeling loss against the real labels. lm_loss = F.cross_entropy( student_logits.view(-1, student_logits.size(-1)), labels.view(-1), ignore_index=-100, ) return alpha * distill_loss + (1 - alpha) * lm_loss ``` A few things worth noting. The temperature parameter softens the teacher's distribution, which exposes more information about the relative ordering of low-probability tokens. The `temperature ** 2` factor keeps the gradient magnitude comparable to the unscaled cross-entropy term. The `alpha` controls the mix between matching the teacher and matching the ground-truth labels, and you usually want both, because pure distillation can let the student inherit the teacher's mistakes. In practice you will not run the teacher live during training unless your teacher is small enough that it fits on the same accelerators alongside the student. For real LLM distillation you precompute the teacher's outputs (top-k logits or full distributions) and store them. The training loop then reads those from disk. This trades storage for compute, and at scale it is almost always the right tradeoff. ## How to choose the student architecture The student architecture decides almost everything about the final tradeoff. Some practical guidance. **Match the teacher's tokenizer.** If the student uses a different tokenizer, you cannot do logit distillation directly because the vocabulary spaces do not align. There are workarounds, but they are painful and lossy. Pick a student that shares the teacher's tokenizer and you avoid an entire class of problems. **Pick a student that is already good.** The student's pre-training matters more than people expect. Distilling onto a randomly initialized small model is much harder than distilling onto a well pre-trained small model. Start from a strong checkpoint of the size class you want, then distill on top. **Aim for 5x to 20x compression.** Below 5x compression, you might as well just quantize the teacher and call it done. Above 20x compression, the student's capacity is so much smaller than the teacher's that distillation alone usually cannot close the gap. The sweet spot is somewhere in between, and it is the regime where most production wins happen: distill a 70B teacher into a 7B student, or a 7B teacher into a 1B student. **Width vs depth matters.** A student that is shallower than the teacher loses reasoning depth. A student that is narrower loses representational capacity. For LLMs, narrower-but-similar-depth students tend to preserve behavior better than shallower-but-similar-width students. Reasoning seems to live in the depth. ## Where distillation actually wins Distillation has clear sweet spots in production. It works very well for **task specialization.** If you have one model that needs to handle one narrow domain (extracting fields from invoices, classifying support tickets, summarizing meeting transcripts), distillation from a frontier teacher onto a small student can preserve almost all the relevant quality at a fraction of the latency and cost. The student does not need to know how to write poetry or do calculus. It just needs to do the one thing. It works well for **format and style transfer.** If your teacher has been carefully tuned to produce JSON in a specific schema, or to refuse certain queries in a specific tone, distillation can transplant that behavior to a smaller model more reliably than re-doing the tuning from scratch on the small model. It works well as a **cost-cutting move on a deployed model.** When you already serve a large model and have collected real production traffic, you can use that traffic as the distillation dataset. The student is trained on exactly the distribution it will see in production. This is one of the highest-ROI uses of distillation that exists. ## Where distillation breaks down Distillation does not magically make a 1B model as smart as a 70B model on every task. Specifically, it tends to fall short in a few places. **Long-context reasoning.** Tasks that require holding a lot of state in working memory and chaining many steps of inference seem to need raw capacity. Distillation can transfer surface behavior but not always the underlying reasoning depth. **Out-of-distribution inputs.** The student inherits the teacher's strengths on the distillation distribution. On inputs that fall outside that distribution, the student often degrades faster than the teacher would. This is why your distillation dataset matters so much. **Calibration.** Distilled students often end up overconfident, especially when distilled from teachers that were themselves overconfident. If your application depends on calibrated probabilities (routing, abstention, threshold-based decisions), measure calibration on the student before you ship. ## Combining distillation with other techniques Distillation does not exist in isolation. In production you will usually stack it with other compression techniques. **Distillation then quantization.** Distill a 70B teacher to a 7B student, then quantize the student to 4-bit or FP8. The quality drop from quantization is usually small if the student was well-trained, and the combined speedup is much larger than either technique alone. **Distillation as a starting point for fine-tuning.** Distill a general student from a general teacher, then fine-tune the student on your domain data. The distilled checkpoint is a much better starting point than a random small model. **Distillation plus speculative decoding.** A distilled small model is an unusually good draft model for speculative decoding against its teacher. Because they share representations, the acceptance rate of the small model's drafts is higher than if you used an unrelated small model. ## Operational details that matter A few things that nobody warns you about until you have shipped a distilled model and watched it misbehave. **Distill on conversation traces, not just single turns.** If your model is used in multi-turn settings (chat, agents, tool use), distill on full traces rather than single prompt-response pairs. Single-turn distillation tends to produce models that handle the first turn well and degrade across turns. **Watch for capability collapse.** If you distill heavily on one task, the student can lose abilities the teacher had on other tasks. Mix in some general data even if you only care about one workload, or be prepared for surprise regressions when users do something off-script. **Evaluate on your real metrics, not benchmarks.** A distilled model can score the same as the teacher on MMLU and behave noticeably worse on the specific thing your users do. Build evals from your own production traffic and grade against those. **Iterate on the dataset, not the loss.** Most of the meaningful gains in distillation come from improving the data, not from tweaking the loss function. New temperature schedule? Probably won't move the needle. More targeted prompts that exercise the failure modes you care about? Will move it a lot. ## Putting it together If you are starting a distillation project today, the path that works most reliably is something like this. Pick a strong pre-trained student in the size class you want. Use the same tokenizer as the teacher. Build a distillation dataset from your real traffic plus targeted synthetic data for behaviors you want to preserve. Train with a mix of distillation loss and standard language modeling loss against any ground-truth labels you have. Evaluate on your own production metrics, not benchmarks. Quantize the result. The output is a model that is 5x to 10x faster and cheaper than the teacher, behaves the way the teacher behaves on the workloads you care about, and runs cleanly on infrastructure built for inference rather than training. That is a genuinely useful position to be in, and it is reachable with techniques that have been stable for a while. The hard part is not the algorithm, it is the dataset and the evaluation discipline around it. If you want to put a distilled model into production with serving infrastructure designed for low-latency inference, [General Compute's API](https://generalcompute.com) is built for exactly this kind of workload: small fast models running on hardware optimized for throughput and tail latency, with an OpenAI-compatible interface so you can drop in your distilled model behind your existing client code. The model is the artifact, the serving stack is what makes it useful. --- # FP8 Training and Inference: The Precision Sweet Spot > Why 8-bit floating point hits a different point on the accuracy/throughput curve than INT8, how E4M3 and E5M2 are used in practice, and what FP8 actually buys you in production serving. - Author: General Compute - Published: 2026-04-29 - Tags: fp8, quantization, inference, training, hopper, blackwell - Canonical: https://www.generalcompute.com/blog/fp8-training-and-inference-the-precision-sweet-spot --- For a long time, the default story for low-precision LLM serving was "train in BF16, quantize the weights to INT4 or INT8, hope nothing important breaks." That story is being replaced by something simpler. With Hopper and now Blackwell hardware, FP8 is a first-class numeric format both for training and for inference, and it sits in a useful spot on the precision/throughput curve. You get roughly half the memory of BF16, double the matmul throughput, and accuracy that is much closer to BF16 than INT8 typically gets you. This post is about what FP8 actually is, why two flavors exist, how training and inference use them differently, and where the format wins or loses against the alternatives. ## What FP8 actually is FP8 is an 8-bit floating point number. Like FP16 or BF16, it has a sign bit, an exponent field, and a mantissa field. Unlike INT8, it is not a uniform grid of values. The representable numbers cluster densely near zero and spread out exponentially as magnitude grows, which is exactly what you want for tensors whose values span several orders of magnitude. There are two common FP8 formats, both standardized in the OCP (Open Compute Project) FP8 spec and supported in NVIDIA Hopper and Blackwell tensor cores: - **E4M3**: 1 sign bit, 4 exponent bits, 3 mantissa bits. Range is roughly +/- 448, with finer resolution near zero. Used for weights and activations. - **E5M2**: 1 sign bit, 5 exponent bits, 2 mantissa bits. Range is roughly +/- 57344, with coarser resolution. Used for gradients and any tensor with a wide dynamic range. E4M3 sacrifices range for precision; E5M2 sacrifices precision for range. The choice between them is not a knob you tune per layer in production. It is dictated by the role the tensor plays. Weights and forward activations are bounded enough that E4M3 fits. Gradients during backward pass can have outliers many orders of magnitude away from the median, and they need E5M2 just to avoid overflowing. The E5M2 format is also the IEEE 754 binary8 format truncated. Some hardware refers to it as `float8_e5m2` and some as `binary8`; they are the same thing in practice. ## Why FP8 and not INT8 INT8 quantization has been around for years. It works. It is also picky in ways that FP8 is not. INT8 is uniform. Every value step is the same size. To represent a tensor whose values range from -50 to +50, you compute a scale, divide everything by the scale, and round to integers in the range -128 to +127. If the tensor has a few outliers at +200, you either clip them (losing information) or you stretch the scale to cover them (losing resolution everywhere else). Activation outliers in transformers are exactly this problem, and a large fraction of the quantization literature is dedicated to managing them: SmoothQuant, AWQ, GPTQ with grouped scales, per-channel quantization, mixed-precision rescue layers. FP8 sidesteps a lot of that. The exponent field gives you orders-of-magnitude coverage natively. An activation tensor with a few channels in the hundreds and most channels near 0.1 fits inside E4M3 without per-channel surgery, because the format is already logarithmic. You still apply a tensor-wide scale, often called `amax/Fmax` scaling, but the format does most of the work that quantization-aware training has to do for INT8. The tradeoff is that FP8 has fewer bits of mantissa than you might want. E4M3 with 3 mantissa bits gives you about the precision of a 4-bit uniform quantizer between adjacent powers of 2. So FP8 is not strictly better than INT8 in every case. It is better at handling dynamic range, often comparable on accuracy, and for the moment it has dedicated hardware paths that INT8 does not have on Hopper. ## Hardware support H100 Tensor Cores expose FP8 matmul at roughly 2x the throughput of BF16 matmul. The actual numbers depend on the SKU and the matmul shape, but for a square matmul on H100, BF16 runs near 990 TFLOPS, FP8 runs near 1980 TFLOPS, and FP8 with sparsity hits roughly 3960 TFLOPS. Memory bandwidth is unchanged, but every byte you move is now half a value instead of a quarter, so for memory-bound kernels (decode, in particular) you also pick up close to 2x. Blackwell pushes this further. B100 and B200 add support for FP6 and FP4 alongside FP8, with similar 2x scaling between adjacent formats. They also add per-block scaling formats (MXFP8, MXFP6, MXFP4) where the scaling factor is shared across a small block (often 32 elements) rather than across the whole tensor. The block-scaled formats give you most of the dynamic-range robustness of per-tensor scaling while letting individual blocks adapt, which matters for activations with concentrated outliers. AMD MI300X supports FP8 matmul. Intel Gaudi 3 supports FP8. The hardware story is no longer NVIDIA-only. The kernel ecosystem still leans heavily on NVIDIA tooling (Transformer Engine, cuBLASLt, FlashAttention with FP8), and that gap is more pronounced than the raw silicon gap. ## FP8 in training Training a large model in FP8 is not the same as inference. The gradients have a much wider dynamic range than the weights or activations, and a single misbehaved gradient can either underflow to zero or overflow to infinity in E4M3. The standard recipe handles this with a few moving parts: 1. Forward activations and weights are stored in E4M3 with per-tensor scales. 2. Backward gradients use E5M2 with per-tensor scales. 3. Master weights stay in BF16 or FP32. Updates accumulate in higher precision. 4. Scaling factors are updated continuously based on the running maximum of recent tensors, often called `amax history`. NVIDIA's Transformer Engine library is the reference implementation. It wraps `nn.Linear`, `nn.LayerNorm`, and the attention path so that the user-facing API stays in BF16 while the matmul kernels run in FP8 internally. The library tracks per-tensor `amax` over the last N steps, picks a scale that puts the maximum near the top of the FP8 range, and falls back to higher precision for any layer where the gradient distribution is too pathological. The actual training results from the original FP8-LM paper (Microsoft and NVIDIA, 2023) and follow-ups: GPT-class models trained from scratch in FP8 reach the same final loss as BF16 within ~0.5% of validation perplexity, with roughly 35% wall-clock speedup and 30-40% memory savings. The numbers depend on the model size and the implementation, but the headline is that FP8 training is not an accuracy compromise. It is a hardware utilization win. The catch: scale management is fiddly. If your `amax` history is too short, scales overshoot and you saturate. If it is too long, scales lag actual tensor magnitudes and you waste range. Most implementations use a 16- or 32-step rolling maximum with a small margin factor. Block-scaled formats on Blackwell remove most of this fiddliness because the scale adapts at the block level automatically. ## FP8 in inference For inference, the FP8 setup is simpler because you do not have gradients to worry about. The two main forms in production are weight-only FP8 and full FP8. **Weight-only FP8** stores the weights in E4M3 with a per-tensor or per-channel scale. Activations stay in BF16. At matmul time, weights are dequantized on the fly into BF16 inside the tensor core. You get half the weight memory and half the weight bandwidth, but the matmul itself runs at BF16 throughput. This is mostly a memory and bandwidth optimization. It helps decode (which is memory-bound) more than prefill. **Full FP8** quantizes both weights and activations to E4M3, runs the matmul in FP8, and accumulates in FP32. This is where you get the 2x compute advantage. The accuracy cost is small but nonzero: typical reports for Llama 70B class models show 0.1 to 0.3 perplexity points on standard benchmarks, with most of that recovered by calibrating activation scales against a small dataset. For chat and coding tasks, the win-rate difference against BF16 is usually inside the noise floor. KV cache in FP8 is its own thing. The KV cache for long-context serving is often the dominant memory consumer, and storing K and V in E4M3 cuts its size in half compared to BF16. This buys you longer context, larger batch size, or both. The accuracy cost is again small if you per-tensor-scale K and V at write time, and slightly larger if you do not. Most serving frameworks (vLLM, TensorRT-LLM, SGLang) support FP8 KV cache as a flag. A detail worth knowing: FP8 KV cache and FP8 attention matmul are not the same toggle. You can store the cache in FP8 and run the attention scores in BF16 (with on-the-fly dequant), which captures the memory benefit without changing the attention numerics. Or you can run the entire attention path in FP8, which captures the compute benefit on Hopper but is more sensitive to scaling. The right choice depends on whether your bottleneck is HBM capacity or attention TFLOPS. ## How FP8 compares to the alternatives Rough intuition for a 70B-class model on H100: | Format | Weight size | Decode tokens/sec (relative) | Quality (vs BF16) | |--------|-------------|------------------------------|-------------------| | BF16 | 140 GB | 1.0x | baseline | | FP8 | 70 GB | ~1.8x | within 0.3 PPL | | INT8 (W8A8) | 70 GB | ~1.7x | 0.3 to 0.8 PPL | | INT4 (AWQ) | 35 GB | ~2.5x | 0.5 to 1.5 PPL | The FP8 vs INT8 comparison is the interesting one. Their memory footprint is identical and their throughput is similar. FP8 wins on accuracy in practice because the format absorbs activation outliers without bespoke calibration. INT4 wins on memory and throughput at a noticeable accuracy cost, and is the right choice when you are bandwidth-bound on smaller GPUs. For training, FP8 is the only viable sub-BF16 format right now. INT8 training exists in research but is not production-ready. The training story is BF16 (the conservative default) or FP8 (the throughput win), and the gap between them keeps shrinking as Transformer Engine and the Blackwell formats mature. ## Where FP8 still hurts Two practical pain points. The first is calibration. Activation scales for FP8 inference are usually picked from a calibration set, similar to INT8. Pick a bad calibration set (too short, too narrow in domain, missing the long-tail distributions) and your scales are wrong, your activations clip or underflow, and accuracy drops more than it should. The fix is to use a calibration set that covers the actual production traffic distribution, not just a generic English corpus. The second is kernel coverage. FP8 matmul is well-supported. FP8 attention, especially with sliding-window or paged-attention layouts, is less mature. Most serving stacks fall back to BF16 attention even when the matmuls run in FP8, which limits the speedup to maybe 1.4x to 1.6x rather than the theoretical 2x. Closing that gap is mostly a kernel engineering problem, and it is being closed quickly, but it is not fully there yet for every attention variant in the wild. ## Why this matters for serving For latency-sensitive workloads (voice agents, real-time coding assistance, anything with a sub-second budget per turn), the FP8 throughput advantage is large enough to change the deployment shape. A model that needs two H100s in BF16 often fits on one H100 in FP8 with the same context length. A model that runs at 50 tokens/sec per request in BF16 hits 90 to 100 in FP8. KV cache memory is halved, which means more concurrent users per GPU. The net effect is that the cost-per-token at a given latency target drops by roughly 40 to 50% with full FP8 over BF16, with quality differences that fall inside benchmark noise on most chat and coding tasks. That is a much better deal than INT8 typically delivers, and it does not require the calibration headaches of INT4. If you are serving an open-weights model in production today, FP8 is the precision you should test against your actual workload before you reach for anything more aggressive. It is the boring answer that happens to be correct most of the time. If you want to see what well-tuned FP8 inference looks like end-to-end, [General Compute](https://generalcompute.com) runs models in FP8 by default on hardware tuned for low-latency serving. The OpenAI-compatible API gives you a fast path to compare against your current setup without rewriting your stack. --- # Activation-Aware Quantization (AWQ) Deep Dive > A close look at how AWQ picks salient weight channels, applies per-channel scaling, and why it consistently beats round-to-nearest 4-bit quantization for LLM inference. - Author: General Compute - Published: 2026-04-28 - Tags: quantization, awq, inference, llm, optimization - Canonical: https://www.generalcompute.com/blog/activation-aware-quantization-awq-deep-dive --- Most quantization writeups stop at "we round the weights to 4 bits and the model still works." That is fine as a marketing line, but it hides the part that actually matters: which weights you keep at higher precision, how you choose them, and why a small amount of per-channel scaling can recover almost all of the lost accuracy. AWQ, short for Activation-aware Weight Quantization, is built around that question. This post goes through the method in detail, including the math, the calibration step, the kernel implications, and the places where AWQ behaves better or worse than the alternatives. If you only know AWQ as "the format my GGUF or vLLM model uses," this should fill in the parts in between. ## The setup: why naive 4-bit quantization fails A linear layer in a transformer computes `Y = X W`, where `X` has shape `[batch * seq, in_features]` and `W` has shape `[in_features, out_features]`. Quantizing the weights means replacing `W` with a low-bit approximation `W_q` such that `W ≈ s * W_q` for some scaling factor `s`. With 4-bit integers, you have 16 possible values per weight, and the scale lets you cover a useful range. Round-to-nearest (RTN) is the simplest version. For each output channel (or group of channels), you find the maximum absolute value, divide by 7 (for a signed 4-bit range of -8 to 7), and round. It works well for small models. It falls apart for large ones, especially after roughly 7B parameters, because of how the activations look. The activations going into a transformer linear layer are not uniformly distributed. A small fraction of input channels carry outlier values that are 10 to 100 times larger than the rest. These outlier channels dominate the layer output. If you treat all weight channels the same way during quantization, you compress the salient ones and the unimportant ones with equal aggression, and the salient ones lose more in absolute terms because they were doing more work. The earlier fix for this was GPTQ, which uses a second-order error correction loop based on the Hessian of the layer's reconstruction loss. GPTQ is good. It is also slow to calibrate, hard to debug, and tightly coupled to the order in which you process columns. AWQ takes a different and simpler route. ## The AWQ insight The AWQ paper from MIT and SJTU starts with a small experiment. Take a quantized LLaMA model. Identify the top 1 percent of weight channels by activation magnitude. Keep those at FP16 and quantize the rest to 4 bits. The perplexity gap to the full FP16 model almost disappears. Keep the top 0.1 percent and you still recover most of the loss. The implication: not all weight channels matter equally, and the ones that matter are exactly the ones whose corresponding input activations are large. That is the activation awareness in the name. The signal that tells you which weights to protect lives in the activations, not in the weight magnitudes themselves. You could just keep those channels in FP16. That works, but mixed-precision storage is annoying. The kernels are messier, the memory layout is weird, and you lose some of the throughput advantage of pure INT4. AWQ avoids that by doing something cleaner: instead of keeping salient channels at higher precision, it scales them up before quantization and scales the corresponding input channels down at inference time. Mathematically the layer output is unchanged, but in the quantized representation those salient weights now have more bits of effective precision because they fall on the high end of the quantization grid. ## The math, more carefully Consider a single input channel `i` going into a weight matrix `W`. Multiply that input by a scale `s_i > 1`, and divide the corresponding row of `W` by `s_i`. The product `X W` is unchanged: ``` Y = (X * diag(s)) * (diag(1/s) * W) ``` Now quantize `diag(1/s) * W` to INT4 instead of `W`. The salient rows of `W`, the ones aligned with the channels carrying large activations, have been divided by a value greater than 1. Their absolute magnitudes are smaller, so when you compute the per-group scale during quantization, those rows occupy more of the dynamic range and round less aggressively in relative terms. At inference time, `X * diag(s)` is just a per-channel multiplication on the input side, which is cheap and can be folded into the previous LayerNorm or absorbed into the previous projection. The quantized weights are stored as INT4 plus a per-group scale and zero point, exactly the same format you would use for plain RTN. There is no mixed precision in storage, no special outlier matrix on the side, no second-order error solver. The kernel can be a vanilla INT4 matmul. ## Choosing the scales The interesting part is picking `s`. Too aggressive and you blow up the dynamic range of unimportant channels and quantize them poorly. Too conservative and you do not protect the salient ones enough. AWQ frames this as a small grid search over a single scalar. The procedure: 1. Run a calibration set of around 128 samples through the model and collect the average per-channel magnitude `a_i` of activations going into the layer. 2. Define a per-channel scale `s_i = a_i^alpha`, where `alpha` is a single hyperparameter shared across the layer. 3. For a grid of `alpha` values in `[0, 1]`, perform the equivalent transformation, quantize the weights, and measure reconstruction error against the FP16 layer output. 4. Pick the `alpha` that minimizes the reconstruction loss. A typical search uses 20 values of alpha. The whole search runs in seconds per layer because each iteration is just a matmul and a quantization pass, no gradients. There is no loop over weight columns, no Hessian, and no per-tensor optimization. That is why AWQ calibrates in minutes for a 70B model where GPTQ takes hours. The choice of `alpha` matters more than people sometimes realize. With alpha = 0, every channel gets a scale of 1, which is plain RTN. With alpha = 1, the scales follow the activation magnitudes directly, which over-protects outlier channels and crushes everyone else. The sweet spot is usually somewhere between 0.5 and 0.8 depending on the layer. ## Group sizes, zero points, and the practical layout AWQ in practice uses group quantization. A group is a contiguous set of weights along the input dimension, typically 64 or 128 elements wide, that share a single scale and zero point. Group quantization is a compromise between per-channel (best accuracy, more metadata) and per-tensor (least metadata, worst accuracy). At group size 128, a 7B model carries roughly 0.5 GB of metadata on top of the 3.5 GB of INT4 weights, which is fine. The zero point is asymmetric. AWQ stores both a scale and an offset per group, which lets it represent distributions that are not centered around zero. This matters more than you would expect for FFN layers, where the weights of the up and gate projections often have a noticeably skewed distribution. The bit-packing layout is interleaved to match the access pattern of common INT4 matmul kernels. Two 4-bit values are packed into one byte, but the order is shuffled so that a single 32-bit load can fetch eight values that get processed together. This is why you cannot just dump AWQ weights into an arbitrary INT4 kernel; you need a kernel that knows the packing convention. The original `llm-awq` repo ships kernels in CUDA, and vLLM, TGI, and TensorRT-LLM have all adopted compatible variants. ## How AWQ compares to GPTQ The two methods solve the same problem and end up at similar accuracy on most benchmarks. The differences are mostly operational. Calibration speed. AWQ is roughly 5 to 20 times faster to calibrate than GPTQ for the same model. On a single A100, GPTQ can take 4 to 6 hours for a 70B model. AWQ finishes in 20 to 40 minutes. Memory during calibration. GPTQ needs to materialize the Hessian for each layer, which is a `[in_features, in_features]` matrix in FP32. For a 70B model with `in_features` around 8192, that is 256 MB per layer, plus working memory for the inverse. AWQ only needs activation statistics and FP16 layer outputs, which are much smaller. Robustness. GPTQ is sensitive to the calibration set distribution. If your calibration data does not match the deployment distribution, GPTQ can over-correct on patterns that do not generalize. AWQ is less sensitive because the search space is one-dimensional per layer. Accuracy ceiling. On well-tuned 4-bit settings with group size 128, GPTQ and AWQ are within 0.1 perplexity points on most LLaMA-class models. AWQ tends to do better on instruction-tuned models with more skewed activations, GPTQ tends to do better on base models, but both are close. There is a third option, SmoothQuant, which uses a similar input-output rescaling trick but for INT8 activations and INT8 weights. SmoothQuant is what you want for compute-bound INT8 inference. AWQ is what you want for memory-bound INT4 inference, which is the regime almost all decoder workloads sit in. ## When AWQ disappoints A few cases to know about. Models with very long input sequences and unusual activation patterns sometimes break the calibration. If you calibrate on short prompts and serve long prompts, the activation statistics shift, and the chosen scales no longer reflect the deployment regime. Recalibrating on representative long-context samples fixes this. Mixture-of-Experts models are tricky. Each expert has its own activation distribution, and routing means that any individual sample only fires a few experts. Getting reliable per-expert calibration statistics needs a larger and more diverse calibration set. Most serving stacks default to 512 or 1024 samples for MoE models instead of the usual 128. Quantizing the attention projections is more error-prone than quantizing the MLP. The attention output projection in particular often shows higher quantization error because its activations are the result of a softmax-weighted sum and have less structure than MLP activations. Some implementations use a smaller group size, like 32, just for attention layers. Stacking AWQ on top of LoRA-merged weights is fine in principle but you have to do the merge first, then calibrate. Calibrating before merging gives you scales that reflect the base model's activation pattern, not the fine-tuned model's, and you lose accuracy. ## The kernel side AWQ's value at inference time comes from being a clean INT4 format with no mixed precision. The kernel is the same shape as any other group-wise INT4 matmul: load packed weights, dequantize on the fly into shared memory, do the matmul against FP16 activations. For decode, where the batch size is small and the workload is memory-bound, AWQ buys you roughly a 3x speedup over FP16 on the linear layers, which matches the bandwidth ratio between FP16 (16 bits per weight) and INT4 (4 bits per weight, plus scale metadata). For prefill, where the workload is compute-bound, the speedup is closer to 1.5x because the kernel still has to materialize FP16 multiplies on the activation side. You also get the memory footprint reduction, which is often more important than the speedup. A 70B model in FP16 is 140 GB. In AWQ INT4 with group size 128, it is around 38 GB, which fits on a single 48 GB GPU with room for KV cache and a reasonable batch size. That changes the deployment story more than the kernel speedup does. ## Calibration data, briefly People often ask what to calibrate on. The original AWQ paper uses 128 samples from Pile or C4. In practice, a small mix of representative deployment data works better. If your model serves chat, calibrate on chat. If it serves code, calibrate on code. The activation statistics shift between these regimes, and the scales follow. The number of samples does not need to be large. The optimization is not learning anything; it is computing per-channel statistics. 128 samples of around 2048 tokens each is enough for stable statistics on a 70B model. Going to 1024 samples helps for MoE. ## Putting it together AWQ is, in the end, a fairly simple idea wrapped in careful engineering. Find the input channels that carry large activations, scale the corresponding weight rows down before quantization, and store the per-group scales next to the INT4 weights. The kernel is plain. The calibration is fast. The accuracy is competitive with GPTQ at a fraction of the work. For most production 4-bit deployments today, AWQ is the format that ends up in the model weights directory, and it is worth knowing why. If you want to try AWQ on your own workload, the `llm-awq` repo is the reference implementation, and most major inference stacks (vLLM, TGI, TensorRT-LLM) load AWQ checkpoints natively. Calibrate on data that resembles your deployment, pick a group size of 128 unless accuracy says otherwise, and verify perplexity on a held-out slice before shipping. If you are running fast 4-bit inference at scale and want to push throughput further, take a look at General Compute's API for what custom ASIC infrastructure looks like under the same OpenAI-compatible interface you already use. --- # Mamba and State Space Models: Inference Without Attention > How structured state space models like Mamba achieve constant-time per-token inference, and why the selective scan changes the trade-off space for long-context serving. - Author: General Compute - Published: 2026-04-27 - Tags: mamba, state-space-models, inference, architecture, long-context - Canonical: https://www.generalcompute.com/blog/mamba-and-state-space-models-inference-without-attention --- If you have spent any time profiling a transformer at long context, you already know where the time goes. The attention operator scales linearly with sequence length per token, the KV cache grows without bound, and HBM bandwidth becomes the wall you keep hitting. Most of the optimization work in inference systems over the last few years has been about chipping away at this: paged attention, KV compression, sliding windows, prefix caching. Each helps. None of them change the underlying scaling. State space models take a different route. Instead of carrying the full history of keys and values, they compress everything seen so far into a fixed-size hidden state and update it recurrently. Mamba is the most prominent example, and it has shown that an SSM-based architecture can match transformers on a wide range of language benchmarks while keeping per-token inference cost constant in sequence length. This post walks through what an SSM actually is, what Mamba changes, and why the math shakes out the way it does at serving time. ## What a State Space Model Is A state space model is the standard form for any system whose evolution can be described by a hidden state that updates over time. In continuous time it looks like this: ``` h'(t) = A h(t) + B x(t) y(t) = C h(t) + D x(t) ``` Here `h(t)` is the hidden state, `x(t)` is the input signal, `y(t)` is the output, and `A`, `B`, `C`, `D` are matrices that parameterize the dynamics. Control engineers and signal processing people have been using this form for decades. The interesting thing for deep learning is that you can discretize it and treat it as a sequence model. Discretization gives you the recurrent form: ``` h_t = A_bar h_{t-1} + B_bar x_t y_t = C h_t ``` where `A_bar` and `B_bar` are the discretized versions of `A` and `B` (typically using a zero-order hold or bilinear transform with a step size parameter). At inference, this is just a recurrence: store `h`, get `x_t`, compute `h_t` and `y_t`, move on. The state is fixed size. The per-token compute does not depend on how many tokens came before. The S4 paper from Albert Gu and collaborators showed that if you parameterize `A` carefully (using a structured form derived from HiPPO theory), an SSM can capture long-range dependencies on synthetic benchmarks better than transformers, while running with linear complexity in sequence length. S4 was the first credible signal that this whole line of work could actually compete on real tasks. ## Why S4 Was Not Enough S4 and its successors (S4D, S5, GSS) demonstrated the asymptotic benefits but had a clear weakness on language. The dynamics in those models are linear time-invariant: `A_bar` and `B_bar` do not depend on the input. The same recurrence runs regardless of what token shows up. This is fine for signals where the relevant structure is roughly stationary, but language is not. The model needs to be able to ignore some tokens and pay close attention to others, depending on context. A linear time-invariant SSM cannot do that. You can see this concretely in tasks like selective copying, where the model has to remember a few specific tokens from a long sequence and ignore the rest. S4 struggles. A transformer with attention can just put weight on the relevant positions. The SSM, with its fixed dynamics, has no mechanism to selectively attend. Linear attention has the same problem in a different form. The state update is content-independent in the sense that every token contributes the same way to the running summary. The model has no gate to say "remember this" or "forget that". This is part of why fixed-feature linear attention has historically lagged on language benchmarks. ## What Mamba Changes Mamba (the S6 architecture, introduced by Gu and Dao in late 2023) makes the SSM parameters input-dependent. Specifically, `B`, `C`, and the discretization step `Delta` become functions of the current input token: ``` B_t = Linear_B(x_t) C_t = Linear_C(x_t) Delta_t = softplus(Linear_Delta(x_t)) ``` The transition matrix `A` stays as a structured per-channel parameter, but the way each input gets written into the state and the way the state gets read out are now content-aware. This is the selectivity that S4 was missing. The model can effectively decide, per token and per channel, how much to update the state and how much of it to expose. The trade-off is that input dependence breaks the convolutional view that S4 and friends used for fast parallel training. With time-invariant SSMs, you can express the entire sequence as a long convolution and use FFT-based kernels to train efficiently. Once `B`, `C`, and `Delta` depend on `x`, the convolutional form is gone. You are back to a recurrence, which on GPUs is bad news unless you are very careful about how you implement it. The Mamba paper's key engineering contribution is a hardware-aware parallel scan kernel. The recurrence ``` h_t = A_bar_t h_{t-1} + B_bar_t x_t ``` is associative in a particular sense: you can compute it with a parallel scan (a generalization of prefix sum) in `O(log n)` depth on a parallel machine. The kernel keeps the state in SRAM, fuses the discretization and the scan, and avoids materializing the full sequence of states in HBM. This is structurally similar to what FlashAttention does for attention: the win is not new math, it is hardware-aware execution. With this kernel, Mamba trains at throughput comparable to a similarly-sized transformer. ## Why Inference Is Fast At inference, you do not need the parallel scan. Generation is one token at a time, so the recurrence runs sequentially: take the current state, compute `B_t`, `C_t`, `Delta_t` from the new token, do one update, emit one output. The cost per token is fixed. Concretely, for a Mamba layer with hidden dimension `d` and SSM state dimension `n`, each step does roughly: - Project `x_t` to get the input-dependent parameters: `O(d^2)` or so depending on the parameterization. - Discretize: cheap, element-wise. - Update the state: `h_t = A_bar_t * h_{t-1} + B_bar_t * x_t` is a per-channel operation of size `d * n`. - Read out: `y_t = C_t * h_t` is another `d * n` operation. Total per-token compute is `O(d * n)` for the SSM part, plus the standard projection costs that any layer has. Memory per layer is `d * n` floats for the state. None of this depends on sequence length. Compare this to a transformer at position `t` in a long generation. Each layer reads `t` keys and `t` values out of HBM to compute attention. The compute is `O(d * t)` per layer per token, and it grows. At long contexts, you are bandwidth-bound on the KV cache and the entire decoder stalls waiting for HBM. The Mamba layer just reads the fixed state and moves on. This is the inference shortcut, and it is qualitatively similar to what RWKV achieves with linear attention. The difference is that Mamba's selectivity gives it a way to compete on language benchmarks where pure linear attention has historically come up short. ## Memory and Throughput The constant-state-size property has practical consequences beyond just per-token cost. A 7B-parameter transformer running at 128K context can easily need 10s of gigabytes of KV cache per request. Serving multiple long-context requests in parallel becomes a memory packing problem. Paged attention, prefix sharing, and aggressive eviction strategies exist because the cache is the dominant resource. A Mamba model of similar size at the same context has a fixed state per layer per request, on the order of a few megabytes total. You can pack many more concurrent long-context requests into the same GPU. Throughput on long-context workloads ends up being substantially better, not just because per-token compute is lower, but because you stop being memory-pressured. This also matters for streaming. A voice agent or transcription system that runs for hours needs a way to keep up without context management heroics. Mamba's state simply is the history, compressed. There is no eviction policy to design, no chunking strategy, no summarization step. The model accumulates indefinitely, and per-token latency stays flat. ## Where SSMs Still Lag Mamba is competitive with transformers on perplexity at the scales it has been trained at, and it generally wins on long-range tasks where the asymptotic advantage shows up. It is not a strict superset of attention, though, and the gap is real on a few specific things. **Exact recall over long contexts is harder.** Compressing the entire history into a fixed-size state forces lossy summarization. Standard attention can pull any prior token verbatim, because the KV cache stores them explicitly. Mamba cannot, in general. Needle-in-a-haystack tests and tasks that require pinpoint retrieval from a long passage are where this shows up most clearly. Recent SSM variants and hybrid architectures have made progress here, but the underlying tension is structural. **In-context learning patterns can be different.** Some of the tricks that work well with attention (looking up exemplars, copying spans, doing precise multi-hop reasoning across a prompt) lean on the same lookup capability. Mamba can simulate these to a degree, but the inductive biases are different and prompts that were tuned for transformers do not always port cleanly. **Tooling is less mature.** The transformer ecosystem has years of optimized serving stacks, quantization recipes, fine-tuning libraries, and adapter frameworks. SSMs are catching up, but if you want to run a Mamba model in production today, expect to do more work than you would with a comparable Llama checkpoint. ## Hybrids Are Probably the Right Answer A growing line of work interleaves a small number of attention layers with many SSM or linear-attention layers. Jamba, Zamba, Samba, and the various Mamba+attention designs all share this idea. The intuition is that attention is good at exact recall and selective lookup, SSMs are good at cheap long-range mixing, and you want a small dose of the first inside a stack that is otherwise the second. Empirically, these hybrids tend to keep most of the inference speed advantage of pure SSMs while closing the recall gap that pure SSMs leave. They also fit well with existing serving infrastructure, since the attention layers can use standard KV cache machinery while the SSM layers ride alongside with their fixed states. For production workloads at long context, the ratio of attention to SSM layers becomes a real tuning knob, and the right ratio depends on what your application actually needs. ## When to Reach For an SSM The case for a state space backbone is clearest when sequence length matters more than peak benchmark accuracy on retrieval-heavy tasks. Voice agents, real-time transcription, document processing pipelines, and on-device assistants are all natural fits. The constant memory and constant per-token compute change what is feasible: workloads that are economically painful with full attention become routine. For chat applications with bounded context windows, the math is less compelling. A 32K-token coding session is not where the asymptotic advantage shows up, and the transformer ecosystem is more mature. The interesting decisions are for new products where context length is in the design space, or for serving infrastructure that needs to handle very long requests at scale. If you want to benchmark SSM models against transformer baselines on your own workloads, or test fast inference for either architecture, the General Compute API supports a range of open models and is built for the latency-bound applications where the architecture choice actually matters. Documentation and a sandbox are at generalcompute.com. --- # RWKV and Linear Attention: Recurrent Models as an Inference Shortcut > How RWKV and linear attention architectures collapse the per-token cost of generation to O(1), and what that means for serving long-context workloads. - Author: General Compute - Published: 2026-04-26 - Tags: rwkv, linear-attention, inference, architecture, long-context - Canonical: https://www.generalcompute.com/blog/rwkv-and-linear-attention-recurrent-models-as-an-inference-shortcut --- The dominant cost of running a transformer at inference time is not the matrix multiplications inside each layer. It is the attention mechanism itself, which has to look back at every previous token before producing the next one. Generate the 10,000th token and the model touches 10,000 keys and 10,000 values. Generate the 100,000th and it touches 100,000. The KV cache grows linearly with sequence length, and the per-token compute does the same. Long contexts get expensive in both memory and time, and the cost is not amortized: every new token pays the full price. Linear attention and recurrent architectures like RWKV try to flip that. Instead of carrying around the full history of keys and values, they compress everything seen so far into a fixed-size state. Generation becomes a constant-time update of that state. No matter how long the context, producing the next token costs the same. That is the inference shortcut. This post walks through why standard attention scales the way it does, what linear attention changes mathematically, how RWKV adapts the idea into something that trains and runs well in practice, and where the trade-offs land when you actually deploy these models. ## Why Standard Attention Is O(n) Per Token A transformer decoder layer does roughly the following at each generation step. The new token's query vector is compared against the keys of every prior token, the resulting scores are softmaxed, and those weights are applied to the corresponding values. Mathematically, for a query q at position t and keys K and values V from positions 1 through t: ``` output_t = softmax(q_t @ K^T / sqrt(d)) @ V ``` The softmax over q_t @ K^T is what makes attention non-linear. It also forces the model to keep all of K and V around, because softmax depends on the maximum and sum of the scores across all positions. You cannot incrementally update a softmax without seeing the new query. The KV cache is the standard optimization: store K and V for the prompt and all generated tokens so you do not recompute them, then append a new row each step. Memory grows with sequence length, and each decode step still does an O(t) dot product across the cache. For a 128K-token context, every new token reads 128,000 key vectors and 128,000 value vectors out of HBM. That bandwidth is the bottleneck on most modern accelerators, not the floating-point math. This is fine when contexts are short. When you start running agents that maintain long histories, voice systems that keep transcripts, or document workflows on full books, the cache pressure becomes the dominant concern. ## The Linear Attention Reformulation Linear attention starts from a small algebraic trick. The softmax in standard attention is the only thing that prevents you from rearranging the computation. If you replace it with something that factors, you can rewrite attention as a recurrence. Specifically, write attention as a sum of similarities: ``` output_t = sum_{i<=t} sim(q_t, k_i) * v_i / sum_{i<=t} sim(q_t, k_i) ``` In standard attention, sim is `exp(q . k / sqrt(d))`. The exponential does not factor across q and k, so you cannot pull q out of the sum. But if you pick sim(q, k) = phi(q) . phi(k) for some feature map phi (for instance, the elu+1 function from the original linear attention paper), then by associativity: ``` sum_i phi(q_t) . phi(k_i) * v_i = phi(q_t) . (sum_i phi(k_i) * v_i^T) ``` The right-hand side is a vector-matrix product where the matrix only depends on the history, not on q_t. Call that matrix S_t. Now S_t can be updated incrementally: ``` S_t = S_{t-1} + phi(k_t) * v_t^T ``` And generation becomes: ``` output_t = phi(q_t) . S_t / (phi(q_t) . z_t) ``` where z_t is a similar running sum used for the normalizer. The state S_t has shape (d_key x d_value), constant in t. Each token does O(d^2) work to update the state and produce the output, independent of how many tokens came before. That is the shortcut. Generation is now O(1) per token in sequence length, and the memory footprint is one fixed-size matrix per layer per head, not a growing cache. The catch is quality. Linear attention with a fixed feature map underperforms full softmax attention on most language tasks. The feature maps studied in the original work are simple, and they cannot represent the sharp, content-dependent attention patterns that softmax produces. You get speed, you lose expressiveness. ## RWKV: A Practical Recurrent Hybrid RWKV (Receptance Weighted Key Value) is the most prominent attempt to take this idea and make it work at scale. The architecture lineage now spans several versions (RWKV-4, 5, 6, 7), and each release has narrowed the quality gap with transformers while keeping the constant-time inference property. The core idea in RWKV is to combine linear-attention-style state updates with a learned time-mixing mechanism. Instead of a pure exponential decay or a fixed feature map, RWKV uses time-decay weights that the model learns per channel. The state update looks roughly like: ``` state_t = exp(-w) * state_{t-1} + k_t * v_t output_t = receptance_t * (state_t / norm_t) ``` where w is a learnable channel-wise decay and receptance is a sigmoid gate that decides how much of the state to expose at each step. Different channels can decay at different rates, so some attend long-range and others act more locally. The receptance gate gives the model a way to suppress or amplify the state contribution token by token. RWKV-5 and RWKV-6 added matrix-valued states (similar to multi-head linear attention) and data-dependent decays, where the decay weights are themselves a function of the input rather than a fixed learned parameter. RWKV-7 went further with delta-rule-style updates that allow the state to overwrite as well as accumulate. Each step pulls the architecture closer to what attention can express, while keeping the recurrent form. The training story is the part that makes RWKV interesting beyond the pure linear attention papers. RWKV is mathematically equivalent to a recurrent network at inference, but the time-decay structure also allows it to be trained in parallel like a transformer. You unroll the recurrence into a parallel form, run it through a CUDA kernel that exploits the structure, and get something close to transformer training throughput. That dual representation, recurrent at inference and parallel at training, is what lets the architecture compete on both axes. ## What Constant-Time Inference Actually Buys The headline benefit is obvious: long contexts are cheap. A 1M-token context generation with a transformer would require an enormous KV cache and prohibitive bandwidth per token. With an RWKV model, the per-token cost at position 1,000,000 is the same as the cost at position 100. Memory per layer is fixed, so VRAM usage does not blow up. That changes a few things in practice. **Streaming workloads become natural.** A voice agent or transcription system that runs for hours can keep accumulating state without a cache management strategy. There is no need to evict old tokens, summarize history, or chunk the context. The state is the history, compressed. **Edge and on-device inference gets easier.** Constant memory means you can ship a small RWKV model to a device and let it run indefinitely without worrying about OOMs from a growing cache. This is part of why RWKV has shown up in mobile and embedded AI projects. **Batching is more predictable.** With transformers, mixing requests of different lengths in a batch creates ragged compute and complicated scheduling. With RWKV, every request does the same fixed amount of work per step regardless of how long it has been running, which makes scheduling and capacity planning simpler. **Cache management goes away.** Prefix caching, paged attention, sliding windows, and similar techniques exist because KV caches are awkward shared resources. None of them are needed for a recurrent model. The state is just per-stream local memory. ## The Trade-offs Linear and recurrent models do not match transformers on every benchmark, and the gap is real if subtle. A few things to keep in mind. **Recall over very long contexts is harder.** Compressing all of history into a fixed-size state means information has to be aggressively summarized as it passes through. Standard attention can pull any token from the past with full fidelity. Recurrent models cannot. This shows up most clearly in needle-in-a-haystack tests and exact-recall tasks, where transformer architectures still tend to win on raw accuracy. Recent RWKV versions and other state-space models have closed a lot of this gap, but it is still a real consideration for tasks that require pinpoint retrieval from long histories. **Training a competitive recurrent model requires care.** The parallel training kernels for RWKV are non-trivial, and getting the time-decay parameterization right has taken multiple architecture revisions. This is a less mature ecosystem than the standard transformer one, which means fewer pretrained checkpoints, fewer mature serving stacks, and more rough edges in tooling. **Determinism in long-context behavior is different.** Because the state is a learned compression, two slightly different histories can converge to similar states, and small changes upstream can have larger downstream effects than in attention-based models. This is mostly a curiosity, but it matters for some applications where you want strict reproducibility of long sessions. **Hybrid architectures are gaining ground.** A growing class of models (Jamba, Zamba, the various Mamba+attention hybrids) interleave a small number of full-attention layers with many state-space or linear-attention layers. These hybrids try to keep the cheap recurrent compute for most of the model while preserving exact-recall capability where it matters. For many production workloads, this is probably where things end up: not pure RWKV, not pure transformer, but a careful mix. ## When to Reach For a Recurrent Model If your workload involves long sequences, streaming inference, or strict memory constraints, a linear-attention or RWKV-style model is worth a hard look. Voice agents, document workers that scan large corpora, on-device assistants, and any application where the per-step cost matters more than per-token quality on retrieval-heavy tasks are good candidates. If you are running a chat application with bounded context or a coding agent where the working set fits comfortably in a 32K window, a standard transformer is probably still the right choice. The quality margin matters more than the asymptotic compute savings at those scales, and the tooling is more mature. The interesting case is when you have flexibility in the architecture choice for a new product. The constant-time property changes what is feasible: workloads that were uneconomical with full attention become routine with a recurrent backbone. That is worth thinking about before you commit to scaling out a transformer-only stack. If you want to test fast inference for these architectures or compare them against transformer baselines on your own workloads, the General Compute API supports a range of open models and is built specifically for the kinds of latency-bound applications where the choice of architecture starts to matter. Documentation and a sandbox are at generalcompute.com. --- # Dynamic Batching Strategies: From Naive to Continuous to Iteration-Level > Batching is the lever that turns idle GPU silicon into served tokens. This post walks through the evolution of batching for LLM serving, from one-at-a-time to static batches to request-level dynamic batching to iteration-level continuous batching, and shows where each strategy still leaves throughput on the floor. - Author: General Compute - Published: 2026-04-25 - Tags: inference, batching, serving, scheduling, throughput - Canonical: https://www.generalcompute.com/blog/dynamic-batching-strategies-from-naive-to-continuous-to-iteration-level --- The arithmetic of GPU inference is unforgiving. A single decode step on a 70B model uses a tiny fraction of the device's tensor cores, because the bottleneck is loading weights from HBM, not multiplying matrices. If you only serve one request at a time, the GPU spends most of its life waiting on memory. Batching is how you amortize that memory cost over many requests, and the way you batch determines whether your serving stack pushes 100 tokens per second or 5,000. LLM serving has been through several batching regimes over the last five years. Each one fixed a specific failure of the previous one, and each one left a different residue of inefficiency for the next paper to clean up. This post walks through that evolution, from the simplest case (no batching at all) through static batching, dynamic request-level batching, and finally iteration-level continuous batching. Along the way I will note where memory management, prefill scheduling, and prefix sharing intersect with the batching question, because in practice you cannot reason about one without the others. ## Why Batching Matters for LLMs Specifically Most ML inference systems before LLMs were compute-bound. A vision model running on a single image saturates the tensor cores within a few milliseconds. Batching gives you better throughput, but the marginal gain per added request is bounded because compute eventually saturates. Decode in an autoregressive LLM is different. Each forward pass loads every parameter (or every active expert in an MoE) from HBM to do a matrix-vector multiply. The arithmetic intensity is something like one floating point operation per byte loaded, far below the roofline crossover for any modern accelerator. This means that for a long stretch of batch sizes, adding requests is nearly free in latency terms while doubling throughput. On an H100 serving Llama 3 70B, going from batch size 1 to batch size 32 typically increases per-token latency by less than 30 percent while delivering roughly 25x more tokens per second. Prefill is closer to compute-bound, since the attention and projection passes operate on the entire prompt at once. But even there, batching still helps until you saturate the device, and the question of how to coexist prefill (compute-bound) with decode (memory-bound) on the same accelerator is one of the hardest scheduling problems in the system. So batching is not optional for LLM serving. The only questions are how you form the batches, when you mutate them, and how you handle the variable-length nature of generation. ## Stage 0: No Batching The simplest possible inference server processes requests one at a time. A request comes in, the server runs prefill on the prompt, runs decode until the model emits an end-of-sequence token or hits the max length, and returns the completion. The next request starts only when the previous one finishes. This is the right design for some uses (single-tenant local inference, latency-sensitive demos with no concurrency), and it is the wrong design for almost any production serving workload. On a 70B model, a single-request server will hit somewhere around 50 to 80 tokens per second on an H100, and the device utilization sits in the low single digits. You are paying for the whole GPU and using a sliver of its capacity. The natural response is to start grouping requests together. ## Stage 1: Static Batching The first improvement is to wait until you have a batch of requests, run them through the model together, and return them all at once. This is how most pre-LLM serving systems handled batching. TensorFlow Serving, TorchServe, and Triton all support a server-side batcher that collects N requests within a time window and processes them as one tensor. For fixed-length classification or ranking, this works well. The batch shape is regular: every request has the same input size (or you pad to the longest), every request takes the same number of forward passes (one), and every request returns at the same time. Throughput scales close to linearly until you hit the compute roofline. For LLM generation, it falls apart fast. The two big problems are: **Variable output length.** Static batching forces every request in the batch to run for the same number of decode steps. If one request finishes after 10 tokens and another needs 500, the 10-token request is "done" but cannot leave the batch. The server keeps generating padding tokens for it (or just keeps it pinned in GPU memory and ignores its outputs) until the longest request completes. That is hundreds of wasted forward passes for the short request, and the user sees their reply latency stretched to match the slowest sibling in the batch. **Variable prompt length.** Static batching also wants uniform input shapes. The standard fix is right-padding to the longest prompt and masking out the pad tokens during attention. The compute is not actually saved, since the model still runs on the padded sequence. For mixed workloads where some prompts are 50 tokens and some are 5,000, padding wastes a large fraction of the prefill budget. Static batching is dead for LLM serving. No production system uses it as the primary batching strategy today. The reason is not that batching is a bad idea, but that the granularity of "the whole batch finishes together" is wrong for autoregressive generation. ## Stage 2: Dynamic Batching at the Request Level Dynamic batching, in the classic Triton or TF Serving sense, refers to a server-side batcher that forms batches opportunistically. Requests arrive at irregular intervals, and the batcher waits up to some max delay (often 1 to 10 ms) to collect a batch of up to some max size. When either the size or the time threshold is hit, it dispatches the batch. This solves the request-arrival pattern problem. You no longer need a queue depth of N before the server does any work; you just wait briefly for additional requests to show up. For workloads with steady traffic, dynamic batching keeps the GPU near its target batch size most of the time. For bursty workloads, it bounds the latency cost of forming a batch. The catch, for LLMs, is that dynamic batching as classically defined still suffers from the static-batching tail problem once the batch is dispatched. The decision of which requests are in the batch is made once, at dispatch time, and then those requests run together until the slowest one finishes. So you can think of dynamic batching as "static batching with smarter batch formation." It improves the average batch size, and it does nothing about head-of-line blocking inside the batch. Some systems extended dynamic batching to handle variable output length by terminating the batch after K decode steps and re-batching the survivors with new arrivals. This helps, but the resync points are expensive: every K steps, the server pauses, runs scheduling logic, and rebuilds the batch tensors. If K is small you pay scheduling overhead constantly, and if K is large you reintroduce most of the original blocking problem. ## Stage 3: Continuous Batching (Iteration-Level Scheduling) The Orca paper, published at OSDI 2022, proposed iteration-level scheduling. The scheduler operates on the granularity of a single decode step, not a whole request. At every iteration, the server: 1. Runs one forward pass on whatever set of requests is currently active. 2. Removes any request that emitted EOS or hit max length. 3. Adds any waiting request from the queue, provided the KV cache has room. The batch composition can change every single step. A request that arrives mid-generation does not have to wait for the current batch to complete; it joins on the next iteration. A short request leaves the moment it finishes and frees its slot for the next waiter. There is no head-of-line blocking, because there is no fixed batch to block in. This was a large practical improvement. The Orca paper reported 36.9x throughput over NVIDIA FasterTransformer on GPT-3 175B at the same latency target. The improvement is not because the model is doing anything different per step; it is because the GPU is doing useful work on a full batch every step instead of a shrinking batch. A few details matter when you actually build this: **Selective batching.** Most operations in a transformer batch trivially across requests, but attention does not, because each request has its own KV cache with its own length. Orca handles this by batching the linear projections (Q, K, V, the output projection, the FFN) across all requests, then unrolling the attention computation per request. PagedAttention and FlashAttention later replaced the unrolled attention with kernels that handle ragged sequences directly, which is faster but follows the same logical split. **Prefill versus decode.** Prefill on a 4,000-token prompt is very different from decode on a single token. Naively mixing them in the same forward pass either underuses the GPU during decode-only iterations or stalls all the decode requests during a prefill iteration. The standard fix today is chunked prefill: break the prefill into chunks of K tokens and interleave them with decode steps. The Sarathi-Serve paper showed that this approach keeps both phases productive without dedicated prefill workers. **KV cache memory.** The number of requests you can hold in the active batch is bounded by KV cache memory, not by compute. PagedAttention, introduced in vLLM, allocates KV cache in fixed-size blocks rather than per-request contiguous regions, which lets the scheduler hold more concurrent requests at the cost of an indirection on every attention read. Without paged memory, fragmentation alone caps your effective batch size well below what compute could support. ## What Iteration-Level Batching Still Misses Continuous batching is not the end of the story. Several follow-on techniques target the inefficiencies it leaves behind. **Prefill and decode have different bottlenecks.** Even with chunked prefill, running both phases on the same accelerator is a compromise. Disaggregated serving (Splitwise, DistServe) uses separate machines for prefill and decode, sized differently and connected by a fast KV transfer link. The decode machines run a continuous-batched scheduler, and the prefill machines run their own batcher tuned for compute throughput. Throughput is higher because each pool runs at its preferred operating point, at the cost of more complex deployment and KV transfer overhead. **Shared prefixes are wasted.** When many requests share a system prompt or a long retrieved context, continuous batching still runs prefill independently for each one. Prefix caching (RadixAttention in SGLang, vLLM's prefix cache) deduplicates the KV cache for shared prefixes. This is orthogonal to batching but compounds with it: prefix caching reduces the effective prefill cost, which lets the scheduler accept more concurrent requests for the same memory budget. **Long-tail decode requests dominate occupancy.** A request generating 4,000 tokens occupies a batch slot for 4,000 iterations. If most requests are short and a few are very long, the long ones take up more and more of the active batch over time, and your effective concurrency drops. Speculative decoding (drafting multiple tokens per step and verifying in parallel) attacks this by reducing the number of iterations a long request needs. Multi-token prediction heads (Medusa, EAGLE) extend the same idea inside the model. **Heterogeneous request priorities.** Pure FIFO continuous batching does not handle priority well. If a latency-critical request arrives while the batch is full of background batch jobs, it has to wait for slots to open up. Modern serving systems layer admission control, priority queues, and preemption on top of iteration-level scheduling. Preemption is feasible because the smallest unit of work is a single decode step; you can pause a low-priority request, evict its KV cache, and restore it later when load drops. ## Practical Choices For new serving deployments, the question is not whether to use continuous batching (you should) but which implementation and how to tune it. The relevant axes: - **Max batch size.** Bounded by KV cache memory. Larger is better for throughput up to the compute roofline, then it hurts per-token latency. - **Max waiting requests in queue.** Affects burst tolerance. Too high and you queue endlessly; too low and you reject load you could have served. - **Chunked prefill chunk size.** Trades prefill latency for decode steadiness. Smaller chunks keep decode TPS stable but stretch time-to-first-token. - **Prefix cache size.** Memory you give up from the active KV cache budget in exchange for cheaper prefill on repeat traffic. vLLM, SGLang, TensorRT-LLM, and TGI all implement continuous batching with their own variations on these knobs. The defaults are usually sensible, but production workloads almost always benefit from tuning to the actual prompt-length and output-length distribution you serve. ## Where General Compute Sits We run iteration-level continuous batching with chunked prefill, prefix caching, and disaggregated prefill/decode, on top of a custom inference stack tuned for our hardware. The reason these batching choices matter is that they directly determine the price-performance curve we can offer for voice agents, coding assistants, and other latency-sensitive workloads. If you want to try inference at a faster operating point than mainstream APIs, the General Compute API is OpenAI-compatible and you can swap your endpoint with a one-line change. The docs at [generalcompute.com](https://generalcompute.com) cover the model list and per-token pricing. If you are building your own serving stack, the practical advice is short. Start from a continuous-batching engine (vLLM is the most common open-source choice). Profile your prompt and output length distribution before tuning anything. Add chunked prefill and prefix caching once you have a baseline. Disaggregate prefill and decode only when you have measured a real bottleneck that motivates the operational complexity. Each layer of the batching stack solves a specific problem; piling them on without diagnosing yours first is how serving deployments end up complicated and slow. --- # Token Merging and Token Pruning for Faster Transformers > Attention cost grows with the square of sequence length. Token merging and token pruning shrink that sequence mid-network, trading a little accuracy for real speedups. Here is how ToMe works, how the idea extends to language models, and where it breaks down. - Author: General Compute - Published: 2026-04-24 - Tags: inference, papers, transformers, token-merging, pruning, vision - Canonical: https://www.generalcompute.com/blog/token-merging-and-token-pruning-for-faster-transformers --- Attention is quadratic in sequence length. For a transformer block with N tokens, the attention matrix is N by N, and the compute and memory costs scale accordingly. Everything else in the block (the feedforward network, the projections, the layernorms) scales linearly with N. So if you want a transformer to run faster and you cannot change the hardware or the model weights, the most direct lever is to reduce N. Token merging and token pruning do exactly that. They shrink the sequence in the middle of the network, after the model has already formed some view of which tokens matter, and let the remaining layers do less work. The techniques started in vision transformers, where adjacent patches often carry similar information, and they have since been adapted to language models, speech models, and multimodal systems. The accuracy cost is real but often small, and the speedup is immediate because it attacks the part of the cost function that grows the fastest. This post walks through ToMe (the canonical token merging paper), compares it to the family of token pruning methods, and covers the issues that come up when you try to apply these ideas to LLM inference rather than vision classification. ## The Sequence Length Problem In a standard transformer layer, the attention pass is O(N^2 * d) and the feedforward pass is O(N * d^2), where d is the hidden dimension. For short sequences, feedforward dominates because d is usually much larger than N. For long sequences, attention dominates because N^2 overtakes N * d. The crossover point is roughly when N is on the order of d. Modern LLMs push this crossover in both directions. Models have grown wider (d increases) but context windows have grown faster (N increases from 2K to 128K to 1M). In vision, a 224x224 image with a 16x16 patch size gives 196 tokens, and a 336x336 image gives 576. Video transformers easily reach tens of thousands of tokens once you add a temporal dimension. Once N is large enough that attention dominates, the payoff for reducing N is direct. Cutting the sequence in half roughly quarters the attention cost of the remaining layers. Most of the reductions in the literature target 30 to 50 percent fewer tokens by the middle of the network, with the intent of keeping enough information that the final prediction is still correct. ## What ToMe Actually Does ToMe (Token Merging) was introduced by Bolya et al. in 2022 for vision transformers. The core idea is that many adjacent tokens in a ViT carry redundant information, so merging them pairwise between attention blocks reduces the sequence without training a new model or adding new parameters. The mechanism has three pieces. The first is a similarity score. After attention, each token has a key vector that the model already computed. ToMe uses the cosine similarity between pairs of keys as the matching signal. This is free: the keys are already there, and the similarity is a cheap dot product followed by a norm. The second is a bipartite matching step. ToMe splits the tokens alternately into two sets (A and B), finds the best match for each token in A within set B using the similarity scores, and picks the top r pairs with the highest similarity. The r parameter is the merging budget for that layer, and it is the main knob that trades speed for accuracy. Setting r to 8 means merge 8 pairs per block, removing 8 tokens per block. The third is the merge itself. Each selected pair gets averaged, with a small weighting adjustment so that merged tokens do not dominate subsequent attention passes. The averaging uses a running count of how many original tokens each merged token represents, and the attention scores are scaled to keep the probability mass correct. This is the "proportional attention" fix, and it matters more than it sounds like it should, because without it the merged tokens punch above their weight and accuracy drops. ToMe is inserted between blocks, not inside them. A ViT-L/16 with ToMe at r=8 per block and 24 blocks ends up with roughly half as many tokens at the output as at the input. On ImageNet classification, ToMe reports around 2x throughput with about 0.4 percent top-1 accuracy drop on ViT-L, and the drop shrinks further when the model is fine-tuned with ToMe active during training. The paper also shows that ToMe composes with other acceleration methods. It does not change the model weights, does not change the training loss, and does not require the architecture to expose any special hooks. You insert it, pick r per block, and you get a faster model. ## Token Pruning: The Other Half of the Family Token pruning takes a more aggressive position. Instead of merging tokens it thinks are similar, it drops tokens it thinks are unimportant. The distinction matters. Merging preserves information from every input token (in aggregate), while pruning discards information entirely. This lets pruning reach larger speedups at the cost of more careful selection. The canonical vision version is DynamicViT (Rao et al., 2021), which adds a small prediction head inside each block that assigns an importance score to each token. Tokens below a threshold are masked out of subsequent layers. DynamicViT needs training with a distillation loss to teach the predictor which tokens are safe to drop, which makes it less plug-and-play than ToMe but also gives it more control over what gets kept. Other vision pruning methods use attention scores directly. EViT (Liang et al.) keeps the top-k tokens by attention weight from the CLS token and fuses the rest into a single pooled token. A-ViT (Yin et al.) uses the model's own halting scores. The variations mostly differ in how they score tokens and whether they drop or fuse the unimportant ones. A useful way to think about the family: merging assumes redundancy, pruning assumes irrelevance. In a densely packed image, most adjacent patches are redundant with their neighbors, so merging works well. In a sparse input (say, a medical image where most of the frame is empty tissue and only a small region matters), pruning works better because the irrelevant tokens really are irrelevant. ## Moving to Language Models Applying token merging or pruning to language transformers is harder than applying it to vision. The reasons are structural, and they explain why LLM serving has adopted these ideas slowly compared to ViTs. First, language tokens are less redundant than image patches. Two adjacent patches in an image are often nearly identical. Two adjacent tokens in a sentence rarely are. Merging "the" and "cat" into an average representation loses the distinction between them in a way that averaging two neighboring pixel patches does not. Second, autoregressive generation is causal. In a ViT, all tokens attend to all other tokens, so dropping token i from the sequence just removes one node from the graph. In a decoder, every future token attends to every past token, so dropping token i changes the attention output of every token j where j > i. The effect compounds across layers, and the model's behavior at generation time diverges from what it saw during training. Third, the KV cache changes the accounting. In vision, reducing the sequence reduces both the forward pass cost and the memory needed to hold activations. In LLM serving, the dominant cost at long context is the KV cache, not the forward pass. A method that reduces the working sequence at each attention pass but does not remove entries from the KV cache does not save memory, just compute. There are now several approaches that address these constraints. LazyLLM (Apple, 2024) introduced dynamic token pruning specifically for LLM inference. The idea is to skip attention computation for tokens that the model does not currently need to attend to, based on attention scores from earlier layers. LazyLLM does not remove tokens from the KV cache permanently; it just skips them during the forward pass for a given step. If a later step needs them, they can be reactivated. This matches the prefill/decode asymmetry well: during prefill, most tokens are important, but during decode, only a small subset of past tokens usually contribute meaningfully to the next token's attention. PyramidInfer (2024) reduces the KV cache size layer by layer, keeping more tokens in the earlier layers and fewer in the later ones. This is shaped like a pyramid, hence the name. The motivation is that the later layers' attention is often concentrated on a small number of tokens, while earlier layers spread their attention more broadly. Keeping the full KV cache in early layers and a pruned version in later layers saves memory without much accuracy cost. H2O (Heavy-Hitter Oracle) and similar methods prune the KV cache directly based on accumulated attention weights. A token that has been attended to many times is a "heavy hitter" and gets kept. A token that has rarely been attended to gets evicted. These methods are more aggressive and can reduce KV cache size by 50 percent or more on long-context workloads, though accuracy on needle-in-a-haystack tasks usually suffers if the pruning is too aggressive. StreamingLLM (Xiao et al.) takes a different angle: keep the first few "attention sink" tokens plus a sliding window of recent tokens, drop everything else. This is pruning as an architectural choice rather than a learned one. It works for infinite-context streaming but loses information from the middle of the sequence, so it is not a drop-in replacement for full attention on retrieval tasks. ## Composition with Other Inference Techniques Token reduction methods compose well with most inference optimizations, but the interactions are worth thinking about. With paged attention and KV cache management, token pruning has to know whether it is removing tokens from the working set (cheap) or evicting them from the cache (requires recomputation if the tokens come back). Most modern implementations distinguish these cases explicitly. With speculative decoding, token reduction in the verifier model has to stay conservative enough that the verifier's scoring of draft tokens does not drift from what the full model would produce. A too-aggressive pruner can cause acceptance rates to drop, which eats into the speculative speedup. With quantization, the effects are roughly additive. A quantized model with token merging gets both the per-operation speedup of quantization and the sequence-length speedup of merging. The accuracy cost is also roughly additive, so the total drop is larger than either alone. In practice, people tune them together rather than stacking them independently. With continuous batching, token reduction shrinks the per-request compute and memory footprint, which lets more requests fit in the same batch. This is usually the easiest way to see the benefit in a serving system: not a change in single-request latency, but a change in how many concurrent requests the engine can sustain. ## When Token Reduction Is Worth It Token merging and pruning are worth it when sequence length is the dominant cost. For short-context chat (a few hundred tokens in, a few hundred out), the feedforward and projection costs are already much larger than attention, and saving tokens does not save much overall. For long-context workloads (summarization, document QA, long agentic traces, vision-language models with many image tokens), attention dominates and the savings show up clearly. They are also worth it when you can afford a small accuracy regression and cannot afford to retrain the model. ToMe famously works without retraining, and most LLM-side methods (LazyLLM, PyramidInfer) are training-free or nearly so. If you have the budget to fine-tune with the method active, accuracy recovers substantially, which is why production deployments often do a short calibration run. They are not worth it when the workload is short-context and compute-bound, or when the accuracy budget is very tight. On a 1K-context chat model, token merging might save 10 percent latency at a 0.5 percent accuracy cost. On a 128K-context document QA model, the same method might save 40 percent latency at the same accuracy cost. The payoff scales with N. ## The Practical Picture Across both vision and language, the lesson is the same. Transformers do not need all their tokens to make their final prediction, and most of the time, the tokens they do need can be identified cheaply. Cosine similarity on keys works. Attention-score heuristics work. Small learned predictors work. The specific choice depends on the model and the workload, but the existence of the redundancy is now well established. For a serving stack, the main question is where to apply the reduction: in the forward pass, in the KV cache, or both. ToMe and its descendants reduce the forward pass. H2O and StreamingLLM reduce the cache. PyramidInfer does both, at different rates per layer. Getting these choices right usually involves profiling the specific model and workload, because the right r per block or the right eviction threshold is not universal. At General Compute, our infrastructure focuses on the primitives that sit underneath these techniques: fast attention kernels, efficient KV cache management, and schedulers that understand variable sequence lengths. Token merging and pruning are less commonly exposed as a user-level feature because they change model behavior in ways developers usually want to control themselves, but the underlying engine is built to support them cleanly when a workload needs them. If you are running long-context inference and want to experiment with this class of optimizations, the [API](https://generalcompute.com) and [docs](https://generalcompute.com/docs) are the place to start. --- # S3: Scheduling for Straggler Mitigation in LLM Serving > In LLM serving, a single long-running request can stall everyone else sharing the same batch. S3 attacks that by predicting output length and scheduling around it. Here is what stragglers actually cost you, and how output-length-aware scheduling helps. - Author: General Compute - Published: 2026-04-23 - Tags: inference, papers, serving, scheduling, tail-latency, fairness - Canonical: https://www.generalcompute.com/blog/s3-scheduling-for-straggler-mitigation-in-llm-serving --- If you watch the per-request latency distribution of an LLM serving system over a long enough window, you start to see a pattern that does not look like a normal distribution. The median is fine. The p90 is roughly what you would expect from your model and hardware. The p99 is several times the median, and the p99.9 is sometimes ten times worse. Most of those tail samples are not the unlucky requests that hit a cold cache or a noisy neighbor. They are the requests that sat behind a much longer one, sharing a batch slot, waiting their turn to make progress. Stragglers are the requests that run substantially longer than the typical request in the batch. In LLM serving, the most common reason a request becomes a straggler is that its output length turns out to be much larger than other requests sharing the same batch. The model decides to keep generating, the engine cannot evict it without losing work, and any request that was sized to fit the same memory budget starts paying for that decision. Output length is the variable that drives most of the tail in production traces. S3 (Jin et al., 2023) was one of the first papers to formalize this and to build a serving system around output length prediction. The idea is straightforward: train a small classifier that predicts the eventual output length of a request based on the prompt, use that prediction to pack the batch more tightly, and reschedule when predictions turn out to be wrong. The full title is "S^3: Increasing GPU Utilization during Generative Inference for Higher Throughput," and while the headline metric is throughput, the underlying mechanism is straggler mitigation. By predicting which requests will run long, the scheduler can avoid the worst kinds of co-location decisions. This post walks through what straggling looks like in practice, how S3 reduces it, and how the broader serving ecosystem has built on the same observation since 2023. ## Where Stragglers Come From A modern serving engine processes many requests concurrently. With continuous batching (Orca), requests join and leave the batch at iteration boundaries, so a finished request can release its slot to a waiting one without draining the whole batch. That works well when requests have similar runtimes. It works less well when one request keeps generating for ten thousand tokens while its batchmates finish in two hundred. The problem is not the long request itself. Long generations are valid, often important, and unavoidable. The problem is what happens to the other requests that were admitted into the batch with the implicit assumption that everyone would finish in roughly the same time. The KV cache is the immediate constraint. When a request is admitted to the batch, the engine reserves space for its KV cache. If the engine sized that reservation to the typical request, a request that runs much longer overflows the reservation and forces an eviction or a swap. If the engine sized the reservation conservatively to the worst case, it can admit far fewer requests and throughput collapses. Either way, the long request distorts the choices made for everyone else. Memory pressure is one symptom. The other is iteration time. A batch of N decodes runs at a rate determined by the largest active request, because the attention pass loads KV cache proportional to context length. A batch with one request at 8K context and four requests at 1K context runs almost as slowly as a batch with five requests at 8K. The short requests pay the memory bandwidth tax of the long one, even though they do not need it. Stragglers also distort fairness. If your scheduling policy is FIFO, a long request that arrived earlier will hold its slot for a long time and delay all later arrivals. If your policy is shortest-job-first, you need to know the job length, which is exactly what is unknown for autoregressive generation. Most production systems sit somewhere between these two and accept that some long requests will block some short ones. ## What S3 Predicts and Why S3's central move is to predict output length up front. The model used for prediction is a small classifier (a few-layer transformer, in the original paper) that takes the prompt as input and outputs a length bucket. The buckets are coarse: short, medium, long, and so on, rather than exact token counts. Coarse predictions are easier to learn, easier to calibrate, and good enough for the scheduling decisions that follow. The prediction is then used in two places. First, batch admission. When a new request arrives, the scheduler uses the predicted length to estimate the KV cache footprint of the request over its lifetime. If admitting it would push the projected memory usage past safe limits during the predicted generation window, the request is delayed or sent to a different batch where it is more compatible with the existing requests. Second, batch composition. Rather than mixing long and short requests indiscriminately, S3 groups requests with similar predicted lengths into the same batch when possible. This reduces the variance within a batch. Memory reservations are tighter because the engine can size them to the predicted length instead of a worst-case envelope. Iteration cost is more predictable because the active context lengths are similar. The result is higher utilization without the tail-latency cost of mismatched batches. The paper reports several-fold throughput improvements on workloads where output length variance is high, which matches what you see in real traffic from chat and code completion. ## Handling Mispredictions A length predictor is wrong sometimes. The interesting question is what the system does when that happens. S3 treats a misprediction as an event the scheduler reacts to. If a request was bucketed as short but is still generating well past the short threshold, the engine has options. It can keep the request in its current batch and accept the disruption. It can evict the request and put it back in the queue, paying the cost of recomputing or transferring its KV cache. Or it can move it to a "long" batch with other long-runners, where its continued generation does not harm shorter neighbors. The paper's evaluation shows that the cost of mispredictions is real but bounded. The classifier is correct often enough that the throughput gains from accurate predictions far outweigh the eviction costs of the misses. The exact crossover depends on workload mix, but roughly, you want the classifier accuracy on coarse buckets to be above 70 percent for the system to be worth running. Calibration matters more than raw accuracy. A classifier that is confidently wrong does more damage than one that hedges. S3 uses class probability outputs rather than hard predictions where possible, so the scheduler can make decisions like "place in medium batch unless confidence in the long bucket is above some threshold." ## Other Forms of Straggler Mitigation S3 is one approach. The space of straggler mitigation in LLM serving has grown since 2023, and S3 belongs to a broader family of scheduling techniques worth knowing. Preemption-based scheduling lets the engine pause a request mid-generation, reclaim its KV cache, and resume it later. vLLM's swap mechanism is an instance: the KV cache for a paused request is swapped to CPU memory and brought back when the request is rescheduled. This makes long requests less disruptive because the engine can preempt them when they start to crowd out shorter ones. The cost is the swap bandwidth and the latency hit on the preempted request. Priority queues with admission control is a simpler approach. Requests are tagged with a priority class on arrival (often based on user tier or use case), and the scheduler admits high-priority requests preferentially. This does not predict anything; it just lets operators express which workloads should win when there is contention. It is widely deployed and complements length prediction rather than competing with it. Output length budgets, set by the client, are another lever. If the client commits to a max_tokens of 200 instead of leaving it at the model maximum, the engine has a hard upper bound it can use for memory planning. Many production deployments enforce or strongly encourage clients to set these budgets, and the effect on tail latency is usually large. This is sometimes called explicit length prediction, in contrast to S3's implicit prediction. Speculative scheduling and rollback is a related technique seen in more recent systems. The engine batches optimistically (assuming requests will not exceed certain thresholds) and rolls back the work for any request that violates the assumption. This is useful when the cost of being wrong is low but the gain from being right is high. Sarathi-Serve's chunked prefill addresses a different straggler: the long prefill that blocks decode iterations. The mechanism is unrelated to output length prediction, but the goal (keep one long thing from punishing many short things) is the same. In production, chunked prefill and output-length-aware scheduling are usually deployed together. They handle different sources of variance. ## Tail Latency Under Output Length Prediction The metric that improves the most under S3-style scheduling is p99 latency for short requests. Without prediction, a short request that lands in a batch with several hidden long-runners pays for their generation time by getting fewer effective batch iterations per second. With prediction, the short request is steered to a batch of similar requests, and the long-runners are grouped elsewhere where their cost is paid by other long-runners. The p99 for long requests does not improve much, because by definition they have to do more work. What does change is that long requests stop interfering with short ones, so the system's overall fairness improves even though the distribution still has heavy tails. A practical effect: SLO planning becomes much easier. If you can promise different SLOs to different request classes (sub-second TTFT and sub-50ms TPOT for chat, looser numbers for batch summarization), you need a scheduler that can keep those classes from contaminating each other. Length prediction is one of the ways to enforce that separation. ## Where Length Prediction Falls Short Output length is not the only source of straggling. A request that starts a tool-call loop, a request that gets stuck in a degenerate repetition, or a request whose stop conditions are never satisfied can all run unexpectedly long for reasons no classifier can predict from the prompt alone. For these cases, the engine needs runtime safety nets: hard token limits, repetition detection, and timeout policies. S3 addresses the predictable variance, not the pathological cases. Length prediction also has worse coverage on instruction-tuned models that were trained to produce long structured outputs (reports, code blocks, multi-step explanations) where the prompt carries little signal about the eventual length. On those workloads, the classifier often defaults to "medium" with low confidence, and the scheduler falls back to its default policy. This is fine; it just means the prediction does not buy as much. Finally, prediction adds latency. The classifier itself runs on every incoming request. The S3 paper's classifier was small enough that this cost was negligible compared to the prefill it precedes. Larger or more accurate predictors might shift the tradeoff. Most production deployments use very small predictors so the prediction cost stays in the low milliseconds. ## How Production Systems Have Adopted the Idea By 2026, output-length-aware scheduling is common in serving stacks that handle high-variance traffic. Most implementations are not direct ports of S3 but follow the same recipe: a lightweight predictor, coarse buckets, batch grouping, and a misprediction policy. vLLM has experimental length-prediction support in some forks and contrib modules. SGLang's scheduler accepts external length hints when the application can supply them. TensorRT-LLM exposes per-request priority and budget hints that can be driven by an upstream predictor. Many production deployments do not run a learned predictor at all and instead rely on user-supplied max_tokens to do the same job. This works well when the client population is well behaved. It works poorly when many clients leave max_tokens at the default, which is one reason serving providers often add a server-side classifier as a safety net even when the API exposes the budget knob. The broader takeaway from S3 is that scheduling decisions in LLM serving are not one-shot. They depend on quantities (output length, KV cache footprint, time-to-completion) that you do not know until generation finishes. Any system that improves the estimate of those quantities up front gives the scheduler more room to make good decisions, and the gains compound across all the other techniques in the stack. ## Closing Stragglers in LLM serving are mostly a symptom of one thing: not knowing how long a request will run. Continuous batching, paged attention, and chunked prefill all assume the engine has reasonable estimates of the work each request will do. When those estimates are wrong, the techniques still work but their throughput and tail-latency benefits narrow. S3's contribution was to take the prediction problem seriously and show that even a small, coarse-grained classifier moves the numbers a lot. At General Compute, fair scheduling under high-variance workloads is one of the things we tune carefully, because voice agents, coding assistants, and multi-tenant inference all exhibit exactly the kind of length distribution where stragglers become expensive. If you are serving a workload where some requests run an order of magnitude longer than others, the techniques in this post (length prediction, priority classes, preemption, chunked prefill) all stack and all help. The docs have the latency and throughput numbers if you want to see how the scheduling choices map to your traffic profile. --- # Chunked Prefill: Overlapping Compute and Communication > Prefill pins the compute units while decode starves for memory bandwidth. Sarathi-Serve splits prefill into chunks and piggybacks decodes on them, keeping both resources busy in the same batch. Here is how it works and where the limits are. - Author: General Compute - Published: 2026-04-22 - Tags: inference, papers, serving, prefill, decode, scheduling, sarathi - Canonical: https://www.generalcompute.com/blog/chunked-prefill-overlapping-compute-and-communication --- Run any modern serving engine at steady state with mixed traffic and you can watch the GPU utilization graph flicker. A long prompt arrives, the engine spends two or three hundred milliseconds doing prefill, and during that window every in-flight decode step is either paused or executed at higher latency than normal. Users see the symptom as a brief freeze in their token stream. Operators see it as TPOT (time per output token) spikes in their dashboards. The interference is not a bug. It is what happens when two workloads with opposite resource profiles share one device. Chunked prefill is the main technique production stacks use to reduce that interference without paying for full disaggregation. Sarathi-Serve (Agrawal et al., 2024) formalized the idea, but the intuition is simple. You break each prefill into smaller chunks and run those chunks in the same batch as active decode steps. The prefill chunk does the heavy matmul work and saturates the tensor cores. The decode steps piggyback on the same batch and use the memory bandwidth that would otherwise sit idle. Neither phase waits for the other. The GPU ends up running closer to its true ceiling, because the two phases fill different parts of the roofline. This post walks through the mechanics of chunked prefill, the tradeoffs Sarathi-Serve explores, and how the technique interacts with the rest of the serving stack. It is a close cousin of disaggregated prefill and decode, and the two approaches are often compared. They solve the same underlying problem with different tools. ## Why Colocated Prefill and Decode Interfere Prefill and decode have well known asymmetries. Prefill takes an input of length N and runs the model once against the whole sequence. The matmuls have a large inner dimension. They saturate the tensor cores and the bottleneck is FLOPs. Decode takes a single new token, attends to the cached prefix, and produces one logit distribution per step. The matmul shapes are small. The bottleneck is the bandwidth required to load the KV cache and the model weights. On an H100 running Llama 3 8B, a typical 2K-token prefill takes somewhere around 150 milliseconds. A single decode step on the same model runs in roughly 15 to 25 milliseconds at reasonable batch sizes. If you run them back to back on the same GPU, decode steps that were queued behind a prefill sit waiting for the prefill to finish. A decode that should have emitted a token every 20 milliseconds instead emits nothing for 150 milliseconds and then resumes. That jitter is exactly what voice agents, coding assistants, and interactive chat feel like they cannot tolerate. Continuous batching (Orca) partially addresses this by allowing the serving engine to add new requests at iteration boundaries rather than waiting for the current batch to drain. It helps with throughput. It does not directly help with the prefill-versus-decode conflict, because a long prefill still occupies a full iteration once it begins. The engine cannot preempt a prefill mid-kernel. Adding priority rules on top of continuous batching is one workaround. You can defer new prefills if decode SLOs are under threat. But you cannot indefinitely defer prefills or TTFT balloons for anyone waiting to start. The two SLOs pull in opposite directions and any scheduling policy is making a tradeoff. ## The Chunked Prefill Idea Sarathi-Serve's observation is that prefill is only indivisible if you let it be. You can compute prefill in slices along the sequence dimension. A 2K-token prefill can be done as four chunks of 512 tokens, or eight chunks of 256. Each chunk is a valid forward pass over a contiguous window of the prompt, and the output is the same KV cache you would have produced in one shot, just assembled piece by piece. Once prefill is divisible, you can place each chunk in a batch alongside active decode requests. The batch has two kinds of workloads inside it: one prefill chunk operating on K tokens, and N decodes operating on one token each. The total number of tokens in the batch is K + N. The attention kernel runs over this mixed input, using masking tricks to handle the different sequence contexts. Here is where the name of this post comes in. In a mixed batch, the prefill chunk does the compute-bound work. It provides enough dense arithmetic to keep the tensor cores busy. The decode operations provide memory-bandwidth demand, because their attention passes have to load the KV cache for each active request. On current GPUs, the tensor cores and the memory subsystem can run in parallel. A single mixed batch can keep both busy at once. You are overlapping compute (prefill chunk) and communication (decode KV cache loads) inside a single forward pass. Sarathi-Serve calls this "stall-free batching." The decode path never has to stop and wait for a prefill to complete. Every iteration includes some amount of decode progress, either as a pure decode batch or as a decode plus prefill-chunk mixed batch. TPOT stays stable regardless of what prefills are flowing through the system. ## Chunk Size Tradeoffs The chunk size is the dial you tune, and it directly controls the tradeoff. Small chunks make the mixed batch decode-dominated. TPOT stays very low because prefill barely perturbs the decode path. But you have split a 2K prefill into many small kernel calls, each of which carries fixed overhead (kernel launch, pipeline setup, attention mask construction). The total wall time of prefill increases. TTFT gets worse. Large chunks reverse the balance. The prefill completes in fewer steps, so TTFT is close to the unchunked baseline. But each step with a large chunk pushes more work into the tensor cores, and decode latency in that step goes up. TPOT regresses. Sarathi-Serve's paper reports that chunk sizes around 512 or 1024 tokens work well on typical hardware for mid-sized models. The optimal number depends on the model shape, the GPU, and the traffic mix. Longer-context workloads tend to want larger chunks because prefill is expensive relative to decode. Short-prompt workloads can run with small chunks without paying much of a TTFT tax. One non-obvious source of chunk-size cost is attention itself. When you prefill chunk K after chunk K-1, the attention pass for chunk K has to attend to all prior chunks (the already-populated KV cache plus any decode history). That extra attention work is not free. Chunked prefill pays a small overhead compared to one-shot prefill of the same sequence, because attention is no longer a single N-by-N matmul but a sequence of partial matmuls with increasing context lengths. On flash-attention-style kernels this overhead is usually a few percent, but it grows with sequence length. For very long prompts, chunking costs real FLOPs on top of the base prefill. ## The Scheduling Policy Chunking is only half of the story. The other half is deciding which requests go into each batch and how to fill the token budget. Sarathi-Serve operates with a per-iteration token budget. Each iteration processes at most B tokens across all requests in the batch (B might be 2048 or 4096, depending on memory and latency targets). Inside each iteration the scheduler looks at the in-flight requests and allocates token slots. Decodes each cost one token. Any remaining budget is filled with a chunk from a pending prefill. If the prefill would exceed the remaining budget, it is sliced so that the chunk fits exactly. This makes the iteration cost predictable. Every iteration runs on roughly B tokens of work, so decode latency is roughly constant. TTFT for any given request is the number of iterations it takes to fully chunk through its prompt, plus queueing delays before its first chunk gets scheduled. The scheduler can also prioritize. If decode SLOs are at risk, it can temporarily reduce the chunk size on new prefills to keep TPOT down. If TTFT is lagging, it can bump chunk size to push prefills through faster. These are simple policies sitting on top of the chunk mechanism, but they give operators a way to bias the system toward whichever SLO is currently bleeding. ## Against the Alternatives Chunked prefill and disaggregated prefill-decode are often framed as competing approaches. In practice they are complementary. Disaggregation (Splitwise, DistServe) moves prefill and decode onto separate GPU pools. It avoids interference entirely, at the cost of a KV cache transfer between pools and the operational complexity of running two fleets. It shines when you have tens of GPUs or more and strict SLOs on both phases. It is overkill for a single-node deployment. Chunked prefill keeps both phases on the same GPU but schedules them carefully so they do not collide. There is no cross-node cache transfer. Implementation is mostly a change to the scheduler and the attention kernel. The engineering complexity is much lower than disaggregation. The downside is that you are still fitting two different workloads on one resource, which caps how far you can push each SLO. A reasonable mental model: chunked prefill gets you 60 to 80 percent of the benefit of disaggregation for 10 percent of the engineering cost. For smaller deployments or for teams that do not yet need maximum throughput, it is the right first move. For large-scale latency-sensitive production, you often end up doing both: chunked prefill within each pool, and disaggregation across pools. ## Implementation Notes Adopting chunked prefill in a serving engine touches a few places: The scheduler has to track per-request prefill progress. Each request carries how many prompt tokens have been processed so far, so the scheduler can slice the next chunk off correctly. The attention kernel needs to handle mixed batches with different context lengths per sequence. This is the "variable length attention" or "flash-attention with variable seq_len" path. Most production kernels already support it because continuous batching needs the same feature. The KV cache layout has to allow incremental writes. If you are using PagedAttention (vLLM) or a similar block-based cache, each prefill chunk writes its slice of the cache into the right blocks. There is no special allocation logic beyond what pure prefill already needs. Observability becomes more important, not less. Chunked prefill makes behavior smoother but also harder to reason about from traces. A request's TTFT now depends on how many chunks it took, how busy the scheduler was when each chunk was eligible to run, and how many decodes were co-batched with those chunks. Good tracing that records per-chunk scheduling decisions pays for itself the first time you debug a TTFT regression. ## How Production Stacks Use It By early 2026, chunked prefill is the default or widely enabled in most serious serving stacks. vLLM has had chunked prefill in master since 2024 and it is on by default in most recent releases. TensorRT-LLM supports it as an opt-in policy. SGLang runs a variant of it as part of its scheduler. Dynamo combines chunked prefill within each pool with disaggregation across pools, which is where the top-of-market latency numbers come from. The technique has also shaped the thinking about how to benchmark serving systems. Older benchmarks reported peak prefill throughput and peak decode throughput as separate numbers. That is no longer informative, because in production the two phases share GPUs and the only number that matters is the sustained mixed-workload behavior under realistic traffic. Sarathi-Serve's paper was part of a shift toward mixed-workload benchmarks that report both TTFT and TPOT distributions across realistic traffic mixes. ## When Chunked Prefill Does Not Help Much If your workload is all prefill (large batch offline inference over short outputs) or all decode (very long generations on short prompts), chunked prefill adds no benefit and may slightly regress throughput because of the chunking overhead. Pure prefill workloads should run one-shot prefill at the largest batch size memory allows. Pure decode workloads are fine with standard continuous batching. If your fleet has only one GPU and very tight SLOs on both TTFT and TPOT, chunked prefill is the right tool, but there are physical limits. A single H100 has a finite amount of memory bandwidth and tensor-core throughput. At high enough load, both phases slow down no matter how cleverly you schedule. At that point the next move is adding GPUs, either by replicating the same setup or by splitting the phases into disaggregated pools. ## Closing Chunked prefill is one of those techniques where the idea is obvious in hindsight and the engineering is mostly unglamorous. You divide a big kernel into smaller ones, schedule them together with other small kernels, and let the hardware do what it was already capable of doing. The payoff is a steadier latency profile for interactive workloads without the operational weight of running separate pools. At General Compute we lean on both chunked prefill inside each node and disaggregation across nodes for the latency-sensitive workloads our customers run. Voice agents and real-time coding assistants are the kinds of workloads where a single 300ms prefill stall can break the user experience, and chunked prefill is one of the simpler levers that keeps those stalls from showing up. If you are shipping something where TPOT consistency matters as much as the average, our API and the serving stack behind it were built around keeping both numbers tight. The docs have the latency and throughput numbers if you want to see how they map to your workload. --- # Cascade Inference: Using Small Models to Route to Big Ones > FrugalGPT and its descendants show that most queries do not need the biggest model. We walk through the cascade pattern, routing classifiers, and the engineering trade-offs of sending easy work to cheap models and escalating only when needed. - Author: General Compute - Published: 2026-04-21 - Tags: inference, papers, routing, cascades, frugalgpt, llm - Canonical: https://www.generalcompute.com/blog/cascade-inference-using-small-models-to-route-to-big-ones --- Most LLM workloads have a long tail of queries that a 7B model could handle perfectly well, mixed with a head of genuinely hard queries that need a 70B or frontier model. Serving everything with the biggest model is the simplest design and also the most wasteful one. A cascade turns that waste into savings by trying a small model first, checking whether the answer is good enough, and only promoting to a bigger model when the small one falls short. The canonical reference here is FrugalGPT by Chen, Zaharia, and Zou at Stanford, published in 2023. They showed that a well-tuned cascade across multiple commercial APIs could match GPT-4 quality on several benchmarks while cutting cost by up to 98 percent. The paper laid out three ideas: prompt adaptation (compressing or pruning prompts), LLM approximation (caching and distillation), and LLM cascade (the pattern we will focus on here). The cascade was the most interesting of the three because it generalizes to any heterogeneous fleet of models, whether those are open-weight models of different sizes, third-party APIs at different price points, or a mix. The basic structure is simple. What makes cascades work or fail in production is the quality of the scoring function that decides whether an answer is good enough to accept, and the latency you pay when the cascade misses and has to escalate. This post walks through how cascades work, how they compare to upfront routing, where they help, where they hurt, and how to build one without creating more problems than you solve. ## The Cascade Pattern A cascade is a pipeline of models ordered from cheapest to most expensive. Each stage tries to answer the query. After each stage, a scoring function inspects the answer and decides whether to accept it or pass the query to the next stage. If no stage accepts, you either return the best answer collected so far or escalate to the final stage unconditionally. The FrugalGPT setup uses three commercial APIs as the stages. In an open-weight setting, the same pattern might look like Qwen 2.5 3B, Qwen 2.5 7B, Qwen 2.5 72B, with the tiny model catching simple lookups, the mid-size model handling most reasoning, and the large model reserved for the queries where both smaller models fail. The scoring function is the interesting part. A few options are common: A separate small model trained as a judge. The judge sees the query and the candidate answer and outputs a confidence score. This works but adds its own cost and latency, and you have to make sure the judge is cheaper than the next stage in the cascade, otherwise you would be better off just running the next stage directly. The log-probability of the generated answer from the model that produced it. A model that is confident in its answer assigns higher probability to the tokens it generated. This is cheap because you already have the logits. It is also noisy, since confidence does not always correlate with correctness, but it works well enough in many cases. A verifier head or separate reward model. Some teams train a small classifier on (query, answer) pairs with ground-truth labels of "correct" or "incorrect" for their domain. This is the most accurate option when you have labeled data, and the worst option when you do not. Heuristic checks for specific failure modes. For structured output tasks, you can parse the answer as JSON or SQL and reject anything that fails to parse. For code generation, you can run the code against unit tests. These checks are extremely cheap and very accurate for the narrow failures they catch, but they do not generalize to open-ended quality judgment. The scoring function has to be substantially cheaper than the next stage, or the cascade stops saving money. A rule of thumb: the judge should cost at most 10 to 20 percent of the cost of running the next stage, including its own latency. ## Routing as an Alternative Cascades do their routing after generation. An alternative is to route before generation: classify the incoming query and send it to the right model directly, skipping the smaller stages when you already know they are going to fail. RouteLLM (Ong et al., 2024) is the best-known example of this. They train a classifier on a mix of labeled query-model-outcome data and preference data, and use the classifier to decide whether a query goes to a strong model or a weak model. The classifier itself is small, often a fine-tuned BERT or an encoder-only model, and runs in a few milliseconds. Routing has two advantages over cascades. First, latency is bounded. A cascade that escalates from small to medium to large pays the sum of those generation times in the worst case. A router that picks "large" upfront pays only the generation time of the large model plus the router itself. Second, routing does not require a scoring function that can judge answer quality, which is often the hardest component to build. The disadvantage is that routing has to predict the right model from the query alone, without seeing the answer. That is a harder problem than judging an answer after the fact. Routing classifiers typically max out around 75 to 85 percent accuracy on the "easy vs hard" distinction, and the errors on the margin cost you quality. In practice, many production systems combine both: a coarse router upfront that filters out queries that obviously need the big model (long context, complex reasoning chains, multi-step tool calls), followed by a cascade over the remaining queries that uses small models as the default and escalates when they fall short. ## What the Trade-offs Look Like The cost savings from a cascade depend entirely on the distribution of queries and the quality of the scoring function. If 80 percent of your queries can be answered correctly by the small model, and your scoring function has 95 percent precision on "small model got it right," then 76 percent of your queries are answered by the small model alone. The remaining 24 percent escalate. If the middle stage handles half of those (12 percent), and the big stage handles the rest (12 percent), you end up paying roughly (76 times small_cost) + (24 times small_cost) + (12 times mid_cost) + (12 times big_cost). If the big model is 20x the cost of the small model and the mid model is 5x, this works out to roughly 3.8 units of cost per query, versus 20 units if you ran every query through the big model. That is about 80 percent savings. The same math shows why cascades break. If the scoring function is wrong, you either escalate too much (losing savings) or accept too often (losing quality). A scoring function at 70 percent precision usually erodes most of the cost benefit. A scoring function at 99 percent precision on the correctness of the small-model answer is hard to build, because it is almost as hard as generating the answer in the first place. The latency story is worse than the cost story. Every query that escalates pays two or three sequential generations plus two or three scoring evaluations. If your small model generates an answer in 300ms and your judge adds 50ms, a query that escalates to the big model (say 1.5s of generation) ends up at 1.85s total instead of 1.5s. That is a 23 percent latency penalty on the escalated queries. For interactive workloads with a strict tail latency SLO, this cost can outweigh the cost savings. Streaming makes cascades awkward. A cascade cannot easily stream tokens back to the client because it does not know if the answer will be accepted until generation finishes. You can stream from the final stage unconditionally, which works if the tail is small, but it defeats the purpose if the small model was supposed to handle 80 percent of traffic. ## Where Cascades Make Sense Batch workloads with loose latency requirements are the best fit. Document processing, bulk summarization, offline classification, and data pipelines all tolerate the extra latency of escalation and benefit directly from the cost savings. A 10x reduction in inference spend on a nightly batch job is real money. High-volume, low-diversity workloads also fit well. If you are processing millions of support tickets, most of them are going to look similar to each other. The small model handles the common patterns, and the cascade only escalates on the unusual cases. Scoring functions are easier to build too, because you can train a classifier on a large labeled set from within your own domain. Retrieval-heavy tasks can benefit when the small model is good enough given good context. If your RAG pipeline is retrieving accurate documents, a 7B model with those documents in context often matches a 70B model answering without them. A cascade here starts with the small model plus retrieved context, and only escalates when the small model hedges or declines to answer. ## Where Cascades Hurt Interactive chat with strict latency requirements is a poor fit. The latency penalty on escalated queries, combined with the streaming problem, makes cascades hard to justify over just picking a good model and serving it directly. Agentic workloads with many sequential LLM calls also suffer. If each step in a ten-step agent goes through a cascade, the escalation latency compounds. A 23 percent penalty per step is a 2.5x total latency blowup across the full trajectory, which is usually worse than just using a slightly bigger model for every step. Safety-critical applications where being wrong is expensive do not tolerate the inherent quality risk of accepting small-model answers. Even at 95 percent scoring precision, 5 percent of accepted answers will be wrong. For customer-facing medical advice, legal reasoning, or high-stakes decisions, the cheaper answer is not worth the occasional silent failure. ## Practical Implementation Notes If you are building a cascade, start by measuring the distribution of your queries. Label a few thousand representative queries with "small model correct" and "small model wrong" using a stronger model or human judgment. This gives you an upper bound on how much savings the cascade can produce. Build the scoring function next. Try the cheapest thing first: log-probs from the generating model, maybe combined with a format check if you have structured output. If those do not hit the precision you need, train a small judge on your labeled data. Do not start with a judge unless you have the data to train it well. Measure the scoring function on a held-out set. Precision matters more than recall. A judge that rejects 30 percent of good answers costs you money (you escalate too often), while a judge that accepts 10 percent of bad answers costs you quality. Tune the accept threshold to optimize for whichever is more costly in your domain. Fall back gracefully. If the final stage of the cascade produces an answer that the scoring function also rejects, return it anyway rather than failing the request. The alternative is to have queries that no model in the pipeline can satisfy, which is almost always worse than returning the best available attempt. Monitor the escalation rate over time. Query distributions drift, and a cascade that was saving 70 percent last quarter might only be saving 40 percent now. If the escalation rate creeps up, the scoring function probably needs retraining, or the small model needs fine-tuning on the new query types. ## Commercial Products and Where the Field Is Going Several companies have built routing and cascade products as commercial offerings. Martian, Unify, and Not Diamond each ship some variant of "one API that routes across many underlying models." RouteLLM (from the LMSYS team) is an open-source reference that teams can self-host. Major inference providers have also started shipping built-in routing features that pick the right-sized model for each query. The research direction is converging on learned routers that are trained end-to-end on the joint objective of cost and quality, rather than hand-tuned pipelines. Recent work also explores using a single model with early-exit layers as an implicit cascade, getting some of the same savings without the operational overhead of managing multiple models. At General Compute, we serve a wide range of model sizes on our ASIC infrastructure. Customers building cascades often run the small and large stages of their pipeline on our API and use the cost savings to afford a stronger final-stage model than they could otherwise justify. If you are thinking about a cascade for your own workload, the latency we deliver at each stage makes the trade-offs easier: a faster small model means less total latency on escalated queries, which shifts the break-even point in favor of the cascade. Take a look at the docs if you want to see per-token speeds across sizes. --- # Lookahead Decoding: Parallel Token Generation Without Draft Models > Lookahead decoding from LMSYS speeds up autoregressive generation without requiring a draft model. We walk through the Jacobi iteration trick, the n-gram pool, and what the speedups actually look like in practice. - Author: General Compute - Published: 2026-04-20 - Tags: inference, papers, decoding, speculative-decoding, lookahead, llm - Canonical: https://www.generalcompute.com/blog/lookahead-decoding-parallel-token-generation-without-draft-models --- Speculative decoding has become the default way to accelerate autoregressive generation. The recipe is well known: run a small draft model to propose several tokens, then have the big model verify them in parallel. When the draft is good, you get multiple tokens per forward pass. When it is bad, you fall back to one token per pass. The catch is the draft model itself. You have to train or pick one, host it, keep its tokenizer aligned, and pay the memory cost of a second set of weights. Lookahead decoding, introduced by Fu et al. at LMSYS in late 2023, gets you parallel token generation without a draft model. It uses the target model itself to fill multiple token positions per step, then verifies the guesses in the same forward pass. There is no second model, no separate training, no tokenizer alignment problem. You drop it into an existing serving stack and decode runs faster on the same weights. The idea sits on top of a classical numerical method called Jacobi iteration. It turns out that autoregressive decoding is structurally similar to a fixed-point problem, and Jacobi iteration is the textbook way to solve those in parallel. Lookahead decoding adapts the technique to language models, adds an n-gram cache to recycle work across steps, and folds the result into a single forward pass per iteration. This post walks through how it works, why it is faster than naive Jacobi, where it stops helping, and how it compares to draft-model approaches. ## Why Decoding Is Sequential in the First Place The standard autoregressive loop is sequential because each token depends on every token before it. Position t cannot be sampled until position t-1 has been chosen, embedded, and propagated through every transformer layer. There is no way to parallelize across the time dimension during generation, even though the prefill phase can process the entire prompt in one pass. Speculative decoding sidesteps this by guessing. If you can produce a plausible sequence of k tokens cheaply (with a draft model), you can run the target model once on those k positions, check whether each guess matches what the model would have sampled, and accept the longest matching prefix. The verification is cheap because it is a single batched forward pass. The expensive part is the guessing, which is why the draft model has to be small. Lookahead decoding takes a different angle. Instead of guessing with a separate model, it uses the target model to refine its own guesses across iterations. Each step generates new candidate tokens for several future positions, and over several iterations those candidates converge to the true autoregressive sequence. The key insight is that the convergence can happen in parallel inside a single forward pass. ## Jacobi Iteration Applied to Decoding Jacobi iteration is a method for solving systems of equations in parallel. Given a fixed-point equation x = f(x), you start with an initial guess and apply f to all components at once, getting a new guess. You repeat until the guesses stop changing. The appeal is that every component update is independent, so the work parallelizes well. Autoregressive decoding can be cast as a fixed-point problem. Define a window of n future token positions. The "true" tokens for those positions satisfy y_i = sample(model(x, y_{ # Disaggregated Prefill and Decode (Splitwise / DistServe) > Prefill and decode have different compute profiles and clash when they share a GPU. Splitwise and DistServe separate them onto different hardware pools. We walk through why, how, and when it actually pays off. - Author: General Compute - Published: 2026-04-19 - Tags: inference, papers, serving, prefill, decode, gpu - Canonical: https://www.generalcompute.com/blog/disaggregated-prefill-and-decode --- Most production LLM stacks still run prefill and decode on the same GPU. A request arrives, the serving engine processes the prompt, and then it streams tokens back, all from the same set of devices. Continuous batching stitches many requests together so the GPU stays busy. This works, and it is what vLLM, SGLang, and TensorRT-LLM do by default. The problem is that prefill and decode are not the same kind of workload. Prefill is heavy, bursty, and compute-bound. Decode is lightweight per step, memory-bandwidth-bound, and lasts for hundreds or thousands of iterations. When you batch them together on one GPU, each phase interferes with the other. Prefill steals compute that decode needs for low per-token latency. Decode holds KV cache memory that prefill wants for concurrency. You can tune the balance with chunked prefill and priority scheduling, but you are still fitting two different workloads onto one resource. Splitwise (Microsoft, 2023) and DistServe (UCSD and Duke, 2024) take a different approach. They split the two phases onto separate GPU pools and transfer the KV cache between them. Each pool runs the workload it is tuned for. The complication is the cache transfer, which has to be fast enough that the handoff does not add visible latency. This post walks through why disaggregation helps, how the two systems implement it, what the tradeoffs look like in practice, and when it is worth the extra plumbing. ## Two Very Different Phases Prefill runs once per request. It takes the input prompt, runs it through the model in one big forward pass, and populates the KV cache. The compute is dense matrix multiplication against the whole prompt length N. With modern GPUs and reasonable prompt sizes, prefill saturates the tensor cores. It is compute-bound, and the bottleneck is FLOPs. Decode runs once per output token. Each step takes a single new token, computes its Q, K, and V against the cached prefix, and produces one logit distribution. The matmul shapes are tiny: batch size by hidden dim. There is no large inner dimension to keep the tensor cores fed. The bottleneck is memory bandwidth, specifically the bandwidth to load the KV cache from HBM into the attention kernel. Two numbers make the asymmetry concrete. On an H100, a prefill pass on a 2K prompt runs at something like 800 tokens per millisecond of compute, because the work is dense. A decode step on the same model runs at maybe 50 tokens per millisecond at batch size 1, because it is bandwidth-limited. Increase the batch size and decode throughput grows nearly linearly until you run out of KV cache memory, while prefill throughput barely moves because each prefill already saturates the device. When you colocate them, you get a scheduling problem. A burst of prefills will push compute contention into any decode requests that happen to be in flight, spiking their time-per-output-token. A large decode batch holds KV cache capacity that a fresh prefill needs. Continuous batching tries to interleave them at the iteration level, and chunked prefill (Sarathi-Serve) tries to split prefills into smaller pieces that can slot in between decode steps. Both help, but both are fundamentally working around the fact that one GPU is trying to do two different jobs. ## The Disaggregation Idea Splitwise and DistServe both propose the same structural fix: run prefill on one set of GPUs and decode on another. A request hits a prefill node, the node computes the KV cache for the prompt, the cache is shipped over the interconnect to a decode node, and the decode node streams tokens until the request completes. The immediate benefit is that each pool can be sized and tuned for its own workload. Prefill nodes want high compute throughput and can live with moderate memory. Decode nodes want high memory bandwidth and lots of HBM for KV cache capacity. If you have a mix of GPUs available, say H100s and older A100s, you can assign them by phase instead of by request. Even if all your GPUs are identical, you can still tune batching policies, KV cache block sizes, and scheduling knobs independently for each pool. The second benefit is SLO separation. Latency targets for prefill (time to first token, TTFT) and decode (time per output token, TPOT) are distinct, and they tug in opposite directions. With disaggregation you can meet each one separately. Prefill nodes can run small batches to keep TTFT low. Decode nodes can run large batches to maximize throughput, because within a single decode step the per-token latency is not very sensitive to batch size until you hit the memory-bandwidth ceiling. The cost is the KV cache transfer. For a 32K-token prompt on Llama 3 70B with GQA, the cache is around 10 GB in FP16. You do not want to move that over a slow network. Both Splitwise and DistServe assume fast GPU interconnects (NVLink within a node, InfiniBand between nodes) and pipeline the transfer so that later layers of the cache are moving while earlier layers are already being consumed by decode. ## Splitwise Splitwise was the first system to propose this split publicly. Patel et al. from Microsoft and the University of Washington observed in 2023 that production Azure workloads had extremely bimodal resource usage. Prefill dominated GPU compute time but a tiny fraction of wall time. Decode dominated wall time but used a small fraction of peak compute. Running both on the same hardware meant either overprovisioning for prefill (wasting decode-phase compute) or underprovisioning (hurting TTFT). Their design assigns request phases to two distinct machine pools. A prefill machine handles input processing for any request, writes the resulting KV cache into a buffer, and hands the request off. A decode machine picks up the request, ingests the cache, and generates output tokens. The handoff uses RDMA over InfiniBand to transfer the cache with minimal CPU involvement. A key Splitwise finding is that the optimal ratio of prefill to decode GPUs depends on workload characteristics, specifically the mean prompt length and output length. Workloads with long prompts and short outputs (summarization, extraction) want more prefill capacity. Workloads with short prompts and long outputs (code generation, reasoning chains) want more decode capacity. With colocated serving, you cannot adjust the ratio. With disaggregation, you just change the GPU counts in each pool. Splitwise also shows a cost-efficiency angle. You can use different GPU SKUs for the two phases. Decode nodes benefit from high HBM bandwidth and capacity but do not need the absolute highest FLOPs. If older GPUs have enough bandwidth for decode, you can keep them in service as decode-only nodes while newer GPUs handle prefill. This extends the useful life of a heterogeneous fleet. ## DistServe DistServe, from Zhong et al. in 2024, pushes the idea further and makes the analysis crisper. They formulate serving as a joint optimization over four variables: parallelism strategy (tensor/pipeline/replica counts) for prefill, same for decode, and batching policies for each phase. With colocated serving, you have to pick one configuration that works reasonably for both phases. With disaggregation, each phase is a separate optimization. Their experiments show that disaggregation can hit tighter TTFT and TPOT SLOs at the same GPU count, or meet the same SLOs with fewer GPUs. The gains are largest when workload latency targets are strict. For workloads where SLOs are loose (offline batch inference, low-priority traffic), the overhead of transfer and the loss of cross-phase batching flexibility often outweigh the benefits. DistServe also runs a careful analysis of the KV cache transfer overhead. On NVLink, the transfer for a single request can happen in parallel with the first few decode steps, effectively hiding the cost. Across nodes on InfiniBand, there is a few hundred microseconds of unavoidable latency, but for prompts where prefill itself took tens of milliseconds, this is a small addition to TTFT. The place where transfer cost starts to hurt is very short prompts with strict TTFT SLOs, where the overhead is comparable to the prefill itself. For that regime, colocated serving is probably still the right answer. ## What Actually Changes in the Stack Implementing disaggregation requires a few pieces that do not exist in a typical serving engine. A shared request queue sits in front of both pools. It tags each request with its current phase and routes accordingly. When prefill finishes, the queue re-enqueues the request with the decode pool, along with metadata about where its KV cache lives. A cache transport layer moves the KV tensors. In practice this is built on something like NCCL, UCX, or a custom RDMA path. The transfer is typically pipelined per transformer layer, so decode can start on early layers while later layers are still moving. The decode engine has to accept a "resume from cache" request rather than always starting from scratch. This is a small API change but it cascades through scheduling, since the decode node has to validate that the cache fits in its memory before accepting the handoff. There are failure modes. If a decode node dies mid-generation, the request is stranded unless the cache can be re-transferred from somewhere or the prompt re-prefilled on another node. If the prefill pool is saturated but the decode pool has capacity (or vice versa), requests queue on one side while resources sit idle on the other. Good routing and autoscaling help, but heterogeneous pools are harder to operate than homogeneous ones. ## When Disaggregation Pays Off Based on the published numbers and what we see in practice: Disaggregation helps most when prefill and decode workloads are large enough to justify separate pools, latency SLOs are strict on one or both phases, and your fleet has fast interconnect between nodes. The classic wins are latency-sensitive chat, voice agents, and coding assistants where TTFT and TPOT both matter and users notice interference when they collide. It helps less when prompts are short and outputs are short (because prefill and decode are both cheap and the handoff overhead dominates), when you only have a handful of GPUs (because you cannot meaningfully split them), or when your workload is highly bursty and benefits from cross-phase batching flexibility. Chunked prefill with priority scheduling, done well, closes some of the gap for colocated setups. Sarathi-Serve's approach of splitting prefill into small chunks and interleaving them with decode steps is cheaper to implement than full disaggregation and captures a meaningful fraction of the benefit. If you are not already running at scale, chunked prefill is the first thing to try. ## How Serving Stacks Are Adopting This By early 2026, disaggregation has moved from research papers to production systems. NVIDIA's Dynamo and TensorRT-LLM both ship disaggregated serving as a supported mode. vLLM has prototype support. SGLang has published disaggregation benchmarks. Most cloud inference providers operating at scale run some form of split deployment internally, even if they do not expose the split to users. The remaining engineering complexity is real. You need good autoscalers for each pool, you need observability that tracks where requests are spending time, and you need to handle cache transfer failures gracefully. For teams serving at single-node scale, these costs still outweigh the benefits. For teams serving across dozens or hundreds of GPUs with strict latency targets, the arithmetic usually flips. At General Compute, we care about disaggregation because it is one of the levers that makes strict latency SLOs achievable at scale. Voice agents and real-time coding assistants are the workloads where a 50ms blip in TPOT is the difference between feeling instant and feeling sluggish. The more we can isolate phases and run each on hardware tuned for its bottleneck, the tighter those SLOs get. If you are building something where inference latency is the user-visible constraint, our API is designed around this kind of serving architecture. Take a look at the docs to see how the throughput and latency numbers translate to your workload. --- # KV Cache Compression: MLA and Beyond > DeepSeek's Multi-Head Latent Attention cuts the KV cache by an order of magnitude without giving up quality. We walk through MLA, how it compares to MQA and GQA, and the other compression techniques worth knowing. - Author: General Compute - Published: 2026-04-18 - Tags: inference, papers, kv-cache, attention, deepseek - Canonical: https://www.generalcompute.com/blog/kv-cache-compression-mla-and-beyond --- The KV cache is the single largest variable cost in transformer inference. Model weights are fixed. The KV cache grows with every token of context, for every request currently in flight. On a serving node, it is usually the KV cache, not the weights, that decides how many concurrent users you can handle and how long a context you can support. DeepSeek's Multi-Head Latent Attention (MLA), introduced in DeepSeek-V2 and refined in V3, attacks this problem at the architecture level. Instead of caching the key and value tensors directly, MLA caches a low-rank projection of them and reconstructs the per-head keys and values on the fly during attention. The cache footprint drops by roughly an order of magnitude compared to standard multi-head attention, and benchmark quality stays essentially unchanged. MLA is the reason DeepSeek can serve very long contexts at competitive cost. MLA is not the only way to compress the KV cache, and it is worth understanding how it fits with the other techniques the field has developed. This post walks through MLA in detail, compares it to MQA and GQA, and then covers the main alternatives: quantized KV caches, eviction policies like H2O and StreamingLLM, and runtime factorization. ## Why the KV Cache Is the Bottleneck During autoregressive decoding, every new token needs to attend to every previous token. The naive approach would recompute K and V for the whole prefix at each step, which is O(N^2) in compute for a context of length N. The standard optimization is to cache the keys and values after the first forward pass and reuse them. New tokens only compute their own K and V and append them to the cache. The per-request size of this cache, for a model with L layers, h heads, head dimension d_h, and sequence length N, is: ``` kv_cache_bytes = 2 * L * h * d_h * N * sizeof(dtype) ``` For Llama 3 70B with 80 layers, 8 KV heads (it uses GQA), head dim 128, FP16, and a 32K context, that is `2 * 80 * 8 * 128 * 32000 * 2 = 10.5 GB` per request. For a vanilla multi-head model with 64 heads instead of 8, the same calculation gives 84 GB per request. That is why nobody ships vanilla MHA at large scale anymore. Everything that improves inference throughput, continuous batching, PagedAttention, prefix caching, has to reckon with the fact that concurrent users share a limited HBM budget, and the KV cache is how that budget gets spent. Shrinking it directly increases how many requests fit on a GPU. ## MQA and GQA: the First Round of Compression Multi-Query Attention (MQA), from Noam Shazeer in 2019, noticed that the Q side of attention needs to be per-head to preserve expressivity, but K and V do not. MQA ties all query heads to a single shared K and V, dropping the KV cache by a factor of h. The tradeoff is a quality regression that shows up clearly on harder benchmarks. Grouped-Query Attention (GQA), from Ainslie et al. in 2023, is the compromise most modern models use. It groups query heads into g groups and uses g KV heads, one per group. Llama 3 uses 64 query heads and 8 KV heads, so g=8, which means an 8x cache reduction versus MHA. GQA preserves most of MHA's quality and has become the default. The limitation of both MQA and GQA is that they directly reduce the number of KV heads. You can only push this so far before quality collapses. You also still cache full-precision K and V tensors for every retained head, so the cache still scales linearly with context length and with the number of KV heads you keep. ## Multi-Head Latent Attention MLA takes a different path. Instead of reducing the number of heads, it keeps all heads but caches a compressed representation of K and V, then reconstructs the full per-head tensors inside the attention computation using a stored projection. The high-level structure looks like this: 1. Compute a shared low-rank latent vector `c_kv` from the hidden state. This latent has dimension d_c, much smaller than `h * d_h`. 2. Cache only `c_kv` for each token. 3. At attention time, project `c_kv` up to the full per-head K and V using two matrices `W^UK` and `W^UV` that are learned and shared across positions. Concretely, for each token: ``` c_kv = x @ W^DKV # shape (d_c,) K = c_kv @ W^UK # shape (h, d_h) V = c_kv @ W^UV # shape (h, d_h) ``` The per-token cache is just `c_kv`, which is `d_c * sizeof(dtype)` bytes per layer. For DeepSeek-V2, d_c is 512 and d_h * h would be around 16384 for comparable MHA, so the cache shrinks by roughly 32x at equal precision. There is a clever reformulation that makes this efficient. You do not actually need to materialize the full K matrix at each step. The attention score for query head i at position t is: ``` score_i = (Q_i^t) @ K_i^{1..t}^T = (Q_i^t @ W^UK_i) @ c_kv^{1..t}^T ``` Because `W^UK` is fixed, you can absorb it into the query projection. The attention computation turns into a product between a modified query and the cached `c_kv`, with no need to expand to per-head K. The same trick works for V: you absorb `W^UV` into the output projection. The result is that MLA's attention kernel operates directly on the compressed latent dimension, which also reduces memory bandwidth during decode, where the attention step is memory-bound. ## The RoPE Problem and the Decoupled Fix There is a subtlety that the simple version of MLA does not handle. Rotary Position Embedding (RoPE) applies a position-dependent rotation to Q and K before the attention score. If K is reconstructed from a compressed latent using a fixed matrix, the latent has to be rotated too, and that rotation depends on the absorbing position. The absorption trick above stops working, because `W^UK` can no longer be pulled into the query projection cleanly once it is interleaved with a position-dependent rotation. DeepSeek's fix is what they call decoupled RoPE. They split each head into two parts: a non-positional part of dimension d_h^nope, reconstructed from the latent as above, and a positional part of dimension d_h^rope that is cached separately after applying RoPE. The positional part is shared across heads, much like MQA, so the extra cache cost is small. The attention score is the sum of a score on the latent part (where the matrix absorption trick works) and a score on the RoPE part (which uses a small shared cache). This is not conceptually elegant, but it is the engineering compromise that makes MLA work with RoPE, and RoPE is non-negotiable for long-context quality. DeepSeek-V2 uses d_h^nope = 128 and d_h^rope = 64. The total cache per token is `d_c + d_h^rope` per layer, which for their setup is `512 + 64 = 576` scalars, compared to `h * d_h = 128 * 128 = 16384` for vanilla MHA. That is a 28x reduction. ## How MLA Compares in Practice DeepSeek-V2 reports that MLA matches MHA on standard benchmarks while cutting the KV cache substantially. The key numbers from the paper: - KV cache per token: about 6.7% of Llama 3 70B's GQA cache at equivalent context length. - Generation throughput: roughly 5.76x that of DeepSeek 67B (a dense MHA model with comparable parameter count). - Training cost: competitive with dense baselines, since the extra projections are modest. For DeepSeek-V3, which is a 671B parameter MoE with 37B active parameters, MLA is what makes long-context serving economically viable. Without it, the cache would dominate memory even on 8xH100 nodes. Retrofitting MLA into an existing model is not trivial. The latent dimension and the RoPE decoupling have to be baked into the architecture and trained from scratch, or at least with heavy finetuning. You cannot drop MLA into a Llama 3 checkpoint. This is a real barrier to adoption for teams that already have trained models they want to keep serving, and it is why most of the open ecosystem is still on GQA. ## Quantized KV Caches A complementary line of work compresses the KV cache at runtime, without changing the architecture. The simplest version stores K and V in INT8 or INT4 instead of FP16. KIVI (Liu et al., 2024) showed that you can quantize keys per-channel and values per-token down to 2 bits with minimal quality loss, using asymmetric quantization with per-group scales. That is an 8x reduction on top of whatever architectural compression you already have. Quantized caches are attractive because they are orthogonal to MLA, MQA, or GQA. You can quantize an MLA cache just as easily as a GQA cache. The cost is extra compute at attention time, since you have to dequantize on the fly, and the implementation has to be careful about kernel performance. vLLM and SGLang both ship INT8 KV cache options, and the quality regression is small enough to be acceptable for most workloads. FP8 KV caches, which Hopper GPUs support natively, are becoming common for production serving. They give 2x compression versus FP16 with essentially no quality impact and no dequantization overhead, since the attention kernels can operate directly on FP8. ## Eviction: H2O and StreamingLLM A different approach asks whether you need to cache every token at all. H2O (Heavy Hitter Oracle) from Zhang et al. (2023) observes that attention is highly skewed in practice. A small subset of tokens, the "heavy hitters," attract most of the attention mass across layers, and the rest can be evicted with little effect on output quality. H2O keeps a fixed-size cache that evicts based on accumulated attention scores. StreamingLLM (Xiao et al., 2023) goes further and identifies an "attention sink" effect: the first few tokens of a sequence receive disproportionate attention, regardless of content, because of how softmax normalizes. Keeping those initial tokens plus a sliding window of recent tokens allows models to generate indefinitely without quality collapse, even when the true context far exceeds their training-time window. These methods are useful when you cannot afford to keep the full cache and are willing to accept some quality degradation on long-range dependencies. They compose with everything else: you can run MLA plus INT4 quantization plus H2O eviction, and multiply the compression ratios. ## Runtime Low-Rank Factorization There is a middle ground between architectural changes (MLA) and runtime compression (quantization, eviction): compress the cached K and V with a learned or SVD-based low-rank factorization applied after training. Methods like LESS and EVA fit small projection matrices that map cached K and V to a lower-rank subspace, and store the factors instead of the full tensors. Quality is not as good as MLA, which is trained end-to-end with the compression in place, but these methods can be applied to existing checkpoints. ## What Actually Matters for Serving If you are choosing a model to serve at scale, the practical picture looks like this: - **Training a new model**: MLA is the strongest option if you care about long-context serving economics. It is harder to implement than GQA and requires careful attention to the RoPE decoupling, but the cache savings are decisive. - **Serving an existing GQA model**: FP8 or INT8 KV cache quantization is the first lever. It is well-supported in modern serving stacks and the quality hit is negligible. - **Serving under extreme memory pressure**: Stack quantization with eviction policies like StreamingLLM if your workload tolerates some accuracy loss on very long dependencies. - **Prefill-heavy workloads**: Prefix caching and sharing across requests (what SGLang's RadixAttention does) matters more than compression, because the cache is populated once and reused many times. The KV cache is a shared resource across requests. Every byte you cut out of a single request's cache is a byte you can spend on another concurrent user. MLA is remarkable because it attacks the problem at the source, in the architecture, rather than layering on compression after the fact. For inference providers, the arithmetic compounds quickly: an 8x smaller cache means 8x more concurrent users per GPU, or 8x longer context at the same concurrency, or some mix of both. At General Compute, we care about this because inference speed and concurrency are what let our customers build real-time applications on top of our API. Architectures like MLA, combined with the right runtime stack, are what make million-token contexts a product feature rather than a benchmark stunt. If you are building a voice agent or a coding agent and bumping into KV cache limits, come talk to us about the tradeoffs. The landscape has moved fast over the last year, and what was infeasible in early 2024 is now routine. --- # Ring Attention: Scaling Context to Millions of Tokens > Ring Attention distributes the attention computation across devices in a ring topology, overlapping KV transfer with compute so context length scales linearly with the number of GPUs. - Author: General Compute - Published: 2026-04-17 - Tags: inference, papers, long-context, distributed - Canonical: https://www.generalcompute.com/blog/ring-attention-scaling-context-to-millions-of-tokens --- Attention scales quadratically with sequence length in FLOPs and linearly in memory once you have a memory-efficient kernel like FlashAttention. The memory part is the harder constraint in practice. A single H100 has 80GB of HBM, and by the time you load model weights, optimizer states (during training), and the activations you need for backward, you run out of room for the KV cache long before you hit interesting context lengths. Doubling the context doubles the KV cache. At some point the query and key tensors just do not fit on one device. Ring Attention, introduced by Hao Liu, Matei Zaharia, and Pieter Abbeel at Berkeley in 2023, solves this by spreading both the queries and the KV across a ring of devices, then rotating KV blocks around the ring while each device computes partial attention on whatever block it currently holds. The memory per device stays bounded, and the maximum context length scales linearly with the number of devices in the ring. With enough hardware, you get context windows of millions of tokens without materializing the full attention matrix anywhere. This is one of the techniques that made million-token context windows practically possible. It is also a nice example of a distributed algorithm where the communication pattern matters as much as the arithmetic. ## The Memory Wall for Long Context Before getting into Ring Attention itself, it helps to be precise about what constrains context length on a single device. For a transformer with hidden dimension d, sequence length N, and batch size B, the activations from a single attention layer are O(B * N * d) for the input and output tensors, plus whatever intermediate buffers the attention kernel needs. FlashAttention reduced the intermediate requirement from O(N^2) to O(N), which was the original breakthrough. But you still need to store Q, K, and V themselves, and you still need to hold the running output. If N is 1M tokens and d is 8192 with FP16, then each of Q, K, V is 1M * 8192 * 2 bytes = 16GB per layer, per batch item. You cannot fit that on one GPU. Not even close, once you add the model weights. The standard response is tensor parallelism and pipeline parallelism, which shard the model along the hidden dimension or across layers. Neither of these helps with the sequence dimension. If your problem is that a single sequence is too long for one device, splitting the model across devices does not buy you anything on the KV cache side. Sequence parallelism is the obvious answer: split the sequence N across devices. But attention is not a pointwise operation. Every query token needs to see every key token, which means if you shard along N, each device needs access to the full K and V at some point during the computation. Naively, that means all-to-all communication, which is expensive. Ring Attention is the clean way to structure that communication. ## The Ring Attention Algorithm Assume you have P devices arranged in a logical ring: device 0 talks to device 1, device 1 talks to device 2, and so on, with device P-1 wrapping back to device 0. Split the sequence of length N into P equal blocks. Device i holds block i of the queries Q_i, and initially also holds block i of the keys and values, K_i and V_i. The algorithm proceeds in P rounds. On round t, each device i: 1. Computes partial attention using its local Q_i against whatever K, V block it currently holds. 2. Accumulates the partial output using an online softmax, just like FlashAttention does internally. 3. Sends its current K, V block to the next device in the ring, and receives a new K, V block from the previous device. After P rounds, each device has computed attention against every K, V block in the sequence, and the accumulated output on device i is the correct attention output for query block Q_i. The trick that makes this fast is that the send and receive on step 3 happen concurrently with the compute on step 1. Modern GPUs have dedicated copy engines and NVLink or InfiniBand interconnects that can move data independently of the SMs doing math. If the compute for one block of attention takes roughly as long as transferring one K, V block to the next device, you get the communication essentially for free. The total runtime is dominated by compute, and the communication hides behind it. The memory on any single device is bounded by two K, V blocks (the current one and the one being transferred) plus one Q block plus the accumulated output. That is O(N/P) total, which is exactly what you want. ## Why It Works: Blockwise Attention Plus the Ring Ring Attention is really the combination of two ideas: **Blockwise attention with online softmax.** This is the part that lets you compute attention incrementally over chunks of K and V without ever materializing the full N by N score matrix. It is the same math that FlashAttention uses internally. You maintain running statistics (the max and the sum of exponentials) and correct the accumulated output as new blocks come in. The result is numerically identical to standard attention, not an approximation. **Ring topology for communication.** The ring is the key to making the communication cost scale well. If you did the same computation with a broadcast or all-to-all, each device would need to receive K, V blocks from every other device in a short burst, which saturates the network. In a ring, each device only communicates with two neighbors at a time, and the total bandwidth used per step is constant regardless of how many devices you have. The time per round is a single K, V block transfer, and there are P rounds, so the total communication time is O(N), matching the compute work per device. This is not just a convenient structure. It is the reason Ring Attention scales. If the communication cost grew with P, adding more devices to extend context would eventually stop helping. ## Ring FlashAttention The natural next step is to combine Ring Attention with FlashAttention. On each device, instead of using a standard attention kernel to compute the partial attention against the current K, V block, you use FlashAttention. This gives you the best of both: the intra-device computation is memory-efficient and tiled to fit in SRAM, and the inter-device computation is load-balanced across the ring. The implementation needs a few tweaks. The FlashAttention kernel normally does its own online softmax internally and emits a final output plus a log-sum-exp statistic. When you are composing it across ring rounds, each round produces a partial output and its own log-sum-exp, and you combine them across rounds using the same online softmax correction that FlashAttention uses internally. So you end up with an online softmax nested inside an online softmax, which sounds terrible but is actually just careful bookkeeping. Most production implementations of long-context serving use some variant of Ring FlashAttention. The ring handles the inter-device dimension, FlashAttention handles the intra-device dimension. ## Striped Attention: The Load Balancing Problem There is a subtle issue with the simple ring algorithm, which is that causal masking creates a load imbalance. In a causal transformer (which is what decoder-only LLMs use), each query token can only attend to key tokens at positions less than or equal to its own. If you split the sequence into contiguous blocks, block i contains queries at positions [i*N/P, (i+1)*N/P). When block i computes attention against block j: - If j < i, the full block is below the diagonal, so every query-key pair is valid. This is a full compute load. - If j > i, the full block is above the diagonal, so no query-key pair is valid. The device does essentially no work. - If j == i, you are on the diagonal and half the pairs are valid. The result is that each device does roughly half the work, but in a bursty pattern: the first few rounds do full work, the last few rounds do nothing, and the overall throughput is cut in half. Striped Attention, a follow-up by William Brandon and others in 2023, fixes this by changing how the sequence is partitioned. Instead of giving each device a contiguous chunk of tokens, you interleave tokens so that each device gets every P-th token. Device i holds tokens at positions i, P+i, 2P+i, and so on. Now when device i computes attention against block j, the set of valid pairs is roughly the same regardless of which block j is, because you are always comparing interleaved slices that span the whole sequence. The compute load per round becomes uniform, and you recover the full theoretical throughput. In practice, you often want to apply striping at the block level rather than the individual-token level, because individual-token striping messes up things like rotary position embeddings and the memory access patterns of the underlying kernels. Block-level striping (give device i blocks 0, P, 2P, ... , block P-1 gets blocks P-1, 2P-1, etc.) gets most of the load balancing benefit without the complications. ## Inference Versus Training Ring Attention was originally pitched as a training technique, and the original paper's benchmarks focused on training throughput. But it is at least as important for inference, possibly more so. During training, you have large batches and can usually absorb long sequences by shrinking the batch size or using gradient accumulation. The activation memory is the bottleneck, and Ring Attention addresses that. During inference, the bottleneck is different. The KV cache grows linearly with context and has to be kept in memory across every decode step. For a single long-context request (one user asking a question about a million-token document), you often have no batch dimension to shrink. The KV cache alone for a 1M token context on a 70B model is about 340GB in FP16. You need to shard it across devices, and when you go to compute attention for the next decoded token, the query on one device needs to see the KV on all the other devices. That is exactly the setup Ring Attention was designed for, just applied to decoding instead of prefill. The query tensor for a single decode step is tiny (one token worth of Q), so you can afford to broadcast or replicate it. The KV is the expensive part, and the ring handles the sharded compute naturally. Some production serving systems use a hybrid: Ring Attention for prefill (where the query is long and the ring is well-balanced), and a different strategy for decode (where the query is one token and you can be cleverer about which devices need to participate at all). But the underlying primitive is the same. ## What This Enables Million-token context windows, like what Gemini 1.5 Pro shipped with, are not possible without something like Ring Attention under the hood. Google has not published the exact architecture, but the publicly available description of their approach makes it clear they are sharding sequences across devices with overlapped communication. Ring Attention is the reference algorithm for this class of technique, and the techniques described in the paper predate Gemini's long-context release. The broader point is that context length is now primarily a systems problem, not an algorithmic one. The attention math itself has been well understood for years. What changed is how efficiently you can execute that math across a large fleet of accelerators, with the communication topology being as important as the arithmetic. Ring Attention is one of the cleanest examples of how to get the topology right. ## Why Purpose-Built Hardware Matters Here Ring Attention squeezes the most out of the hardware topology you have. But the underlying constraint is still memory bandwidth and interconnect bandwidth on chips that were not designed for this workload. NVLink and InfiniBand are fast, but they are general-purpose, and the choreography required to keep a ring saturated is fragile. General Compute is the only neocloud built entirely on inference-optimized ASICs. The interconnect fabric between chips is designed specifically for the access patterns that long-context inference needs, including the streaming KV transfers that Ring Attention relies on. Combined with much higher on-chip memory per accelerator, the pressure on sequence parallelism is reduced in the first place. You need fewer devices in your ring, and the rounds are faster, so long-context serving is faster and cheaper end to end. If you want to try running long-context inference on hardware that was purpose-built for it, [sign up at generalcompute.com](https://generalcompute.com) and get $200 in free credit to benchmark it against your current setup. ## Papers and References - [Ring Attention with Blockwise Transformers for Near-Infinite Context](https://arxiv.org/abs/2310.01889) (Liu, Zaharia, Abbeel, 2023) - [Blockwise Parallel Transformer for Large Context Models](https://arxiv.org/abs/2305.19370) (Liu and Abbeel, 2023) - [Striped Attention: Faster Ring Attention for Causal Transformers](https://arxiv.org/abs/2311.09431) (Brandon et al., 2023) - [World Model on Million-Length Video and Language with Blockwise RingAttention](https://arxiv.org/abs/2402.08268) (Liu et al., 2024) --- # Quantization for Inference: GPTQ, AWQ, SmoothQuant, and FP8 > Quantization shrinks model weights from 16-bit to 4-bit or 8-bit, cutting memory usage and speeding up inference. Here's how the major techniques work and when to use each one. - Author: General Compute - Published: 2026-03-26 - Tags: inference, papers, deep-dive - Canonical: https://www.generalcompute.com/blog/quantization-for-inference-gptq-awq-smoothquant-fp8 --- A 70 billion parameter model stored in FP16 (16-bit floating point, the standard precision for LLMs) takes about 140GB of memory. That's two A100 80GB GPUs just to load the weights, before you even account for the KV cache and other overhead. Quantization reduces the precision of those weights, from 16 bits down to 8 or even 4 bits per parameter. A 70B model in 4-bit takes about 35GB, fitting on a single GPU. The model runs faster (less data to move from memory) and uses less memory (room for more concurrent requests), with surprisingly little quality loss. The catch is that naive quantization (just rounding everything to lower precision) destroys model quality. The four techniques covered here each found clever ways to quantize accurately. ## Weight-Only vs. Weight-and-Activation Quantization Before diving into specific methods, it helps to understand the two main categories. **Weight-only quantization** (GPTQ, AWQ) shrinks the stored model weights to 4-bit or 8-bit, but during computation, those weights get dequantized (converted back) to FP16 before the actual matrix multiplication happens. The speed benefit comes entirely from reduced memory bandwidth: reading 4-bit weights is 4x faster than reading 16-bit weights. The math itself still runs in FP16. This is ideal for the decode phase (generating tokens one at a time), which is almost entirely memory-bandwidth-bound. You're spending most of your time reading weights, so making them smaller directly speeds things up. **Weight-and-activation quantization** (SmoothQuant, FP8) quantizes both the weights and the input activations (the data flowing through the network), so the actual matrix multiplication runs in lower precision (INT8 or FP8) on specialized hardware (tensor cores). This speeds up both the memory transfer and the compute. This helps most during the prefill phase (processing the input prompt), which is more compute-bound because you're processing many tokens in parallel. Faster math means faster prefill. ## GPTQ: The First Practical Large-Model Quantization GPTQ (Frantar et al., October 2022) was the first method to make post-training quantization (quantizing after training, without retraining) work well on models with 100B+ parameters. The core idea comes from a family of techniques called Optimal Brain Quantization. GPTQ quantizes weights one column at a time, and after quantizing each column, it adjusts the remaining unquantized columns to compensate for the error introduced. The adjustment uses second-order information (based on the Hessian matrix, which captures how sensitive the model's output is to changes in each weight) computed from a small calibration dataset. The key practical innovation was making this process fast enough to run on large models. GPTQ can quantize a 175B parameter model in a few hours on a single GPU, which was previously impractical. **Results:** 3-bit and 4-bit quantization with minimal accuracy loss on models up to 175B parameters. A 4-bit quantized 70B model fits on a single 80GB GPU and runs roughly 2-3x faster than the FP16 version due to reduced memory bandwidth. **Tradeoff:** Weight-only, so the compute itself is still FP16. The speedup comes purely from less memory to read. ## AWQ: Activation-Aware Weight Quantization AWQ (Lin et al., MIT Han Lab, June 2023) took a different approach. Instead of compensating for error after quantization, it identified which weights are most important to preserve accurately before quantizing. The key observation: only about 1% of weight channels are "salient" (critically important for output quality), and you can identify them by looking at the activation magnitudes (how large the values flowing through the network are at each position), not the weight magnitudes. Channels that see large activations are the ones where quantization error hurts the most. AWQ applies a mathematically equivalent scaling transformation that makes the salient channels larger (and therefore less affected by rounding) while making less important channels smaller. After this transformation, standard quantization works much better because the important information is protected. **Results:** Generally shows less accuracy degradation than GPTQ, especially at very low bit-widths (3-bit). Won the MLSys 2024 Best Paper Award. **Why it matters for serving:** AWQ is hardware-friendly because it doesn't use mixed-precision (which would require special handling). All weights are the same bit-width, making kernel implementation straightforward. This is why AWQ is widely supported in vLLM, TensorRT-LLM, and other serving frameworks. ## SmoothQuant: Making Activation Quantization Work Both GPTQ and AWQ only quantize weights. SmoothQuant (Xiao et al., November 2022) tackled the harder problem of quantizing activations too, enabling W8A8 (8-bit weights and 8-bit activations) inference. The problem with quantizing activations is that they contain outliers. A few channels in the activation tensors have values that are 10-100x larger than the rest. If you quantize to INT8 (which has a range of -128 to 127), these outliers either get clipped (destroying information) or force the entire quantization range to be so wide that the normal values lose all precision. SmoothQuant's insight: migrate the difficulty from activations to weights. It applies a per-channel scaling factor that divides the activation outliers by a constant and multiplies the corresponding weights by the same constant. This is a mathematically equivalent transformation (the model computes the same result), but after applying it, the activations are much smoother and easier to quantize. **Results:** Up to 1.56x inference speedup and 2x memory reduction on models like OPT-175B and BLOOM-176B with negligible accuracy loss. Because both weights and activations are in INT8, the actual matrix multiplication runs on INT8 tensor cores, which are faster than FP16 tensor cores. **Why it's different from GPTQ/AWQ:** The speedup comes from faster math, not just less memory to read. This matters most for compute-bound workloads (large batch sizes, prefill). ## FP8: The New Standard FP8 quantization (8-bit floating point) emerged in 2023-2024, enabled by hardware support on NVIDIA's Hopper (H100) and Ada Lovelace GPUs. Unlike INT8 (which has a fixed range and uniform spacing between values), FP8 is a floating-point format with an exponent and mantissa, giving it a wider dynamic range. This makes it much easier to apply to both weights and activations without the outlier problems that SmoothQuant had to work around. There are two FP8 formats: E4M3 (4 exponent bits, 3 mantissa bits, better precision) and E5M2 (5 exponent bits, 2 mantissa bits, wider range). Typically E4M3 is used for weights and forward-pass activations, while E5M2 is used for gradients during training. **Results:** ~33% improvement in tokens/s and 8.5% lower TTFT compared to FP16 on H100s. FlashAttention-3 integrates FP8 support, achieving 1.2 PFLOPs/s for attention computation. **Why it's winning:** FP8 is simpler to apply than INT8 quantization (fewer calibration issues), has native hardware support on modern GPUs, and the quality loss is minimal. It's rapidly becoming the default precision for inference on H100s. ## When to Use What | Method | Precision | Type | Best For | Quality Impact | |---|---|---|---|---| | GPTQ | 4-bit | Weight-only | Fitting large models on small GPUs | Low | | AWQ | 4-bit | Weight-only | Production serving, best 4-bit quality | Very low | | SmoothQuant | W8A8 | Weight + activation | Compute-bound workloads, large batches | Very low | | FP8 | 8-bit | Weight + activation | H100/H200 inference, general purpose | Minimal | For most production deployments on modern hardware: use FP8 if you have Hopper GPUs, AWQ if you need 4-bit to fit the model in memory. ## How ASICs Change the Equation Quantization techniques were developed primarily to work around GPU limitations: limited memory capacity, limited memory bandwidth, and the desire to use specialized low-precision tensor cores. Each technique is a software solution to make models fit and run faster on hardware that wasn't designed specifically for inference. General Compute runs on inference-optimized ASICs that handle precision and memory differently at the hardware level. Our chips are designed from the ground up for the data types and access patterns that transformer inference uses, with native support for the precision formats that matter most for serving. We don't need to choose between "fits in memory" and "runs fast" because the hardware was designed with both in mind. The result is that we can serve models at full quality and speed without the tradeoffs that GPU-based providers have to make around quantization. [Sign up at generalcompute.com](https://generalcompute.com) and get $200 in free credit to try it out. ## Papers and References - [GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers](https://arxiv.org/abs/2210.17323) (Frantar et al., 2022 -- ICLR 2023) - [AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration](https://arxiv.org/abs/2306.00978) (Lin et al., 2023 -- MLSys 2024 Best Paper) - [SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models](https://arxiv.org/abs/2211.10438) (Xiao et al., 2022) - [An Investigation of FP8 Across Accelerators for LLM Inference](https://arxiv.org/abs/2502.01070) (2025) --- # Multi-Query and Grouped-Query Attention: Shrinking the KV Cache > MQA and GQA reduce the memory footprint of attention by sharing key-value heads across queries. A simple architectural change that makes inference dramatically faster. - Author: General Compute - Published: 2026-03-25 - Tags: inference, papers, deep-dive - Canonical: https://www.generalcompute.com/blog/multi-query-grouped-query-attention --- Standard multi-head attention (MHA) gives each attention head its own set of keys and values. For a model with 32 heads, that means 32 separate key tensors and 32 separate value tensors stored in the KV cache (the memory that holds the model's "working memory" of the conversation). During generation, every single one of those tensors needs to be read from memory for every token produced. This is a lot of memory traffic for information that's often highly redundant across heads. Two papers showed that you can share key-value heads across multiple query heads with minimal quality loss, dramatically reducing the KV cache size and speeding up inference. ## The Memory Bandwidth Problem in Decoding During autoregressive decoding (generating one token at a time), the model needs to read the entire KV cache for all previous tokens at each step. For a model like Llama 2 70B with 64 attention heads and a 4096-token context, the KV cache is roughly 2.5GB. Every single token generation step requires reading all of that from GPU high-bandwidth memory (HBM). With standard multi-head attention, the KV cache size scales as: num_layers x num_heads x 2 (one for K, one for V) x sequence_length x head_dimension x bytes_per_element. The num_heads term is the target of MQA and GQA. If you could reduce the number of KV heads without hurting model quality, you'd directly reduce memory bandwidth requirements during decoding, which is the primary bottleneck for inference speed. ## Multi-Query Attention (MQA) Multi-Query Attention was proposed by Noam Shazeer (one of the original Transformer paper co-authors) back in 2019. The idea is radical in its simplicity: instead of giving each attention head its own keys and values, use a single key head and a single value head shared across all query heads. Each query head still computes its own unique attention pattern (so the model can still attend to different things from different perspectives), but they all attend over the same set of keys and values. The impact on the KV cache is dramatic. For a model with 64 heads, MQA reduces the KV cache size by 64x. That's the difference between a 2.5GB KV cache and a ~40MB one. In practice, the quality impact is small but measurable. MQA was adopted by several major models including PaLM (Google's large language model) and Falcon. The tradeoff was considered worthwhile because the inference speedup is enormous, especially for long sequences where the KV cache dominates memory usage. **Paper:** "Fast Transformer Decoding: One Write-Head is All You Need" (Shazeer, 2019) ## Grouped-Query Attention (GQA) Grouped-Query Attention (Ainslie et al., Google, May 2023) is the middle ground between standard multi-head attention and multi-query attention. Instead of one KV head shared by all queries (MQA) or one KV head per query (MHA), GQA uses an intermediate number of KV head groups. For example, a model with 32 query heads might use 8 KV head groups, so each group of 4 query heads shares one set of keys and values. This gives you an 4x reduction in KV cache (compared to MHA) while keeping quality closer to the full multi-head version. The paper also showed something practically useful: you can take an existing model that was trained with multi-head attention and "uptrain" it (continue training for a short period) to use grouped-query attention, using only about 5% of the original pre-training compute. You don't need to train from scratch. **Results:** GQA-8 (8 KV groups) achieves quality close to full MHA while running at speeds close to MQA. The authors showed this on a 150B parameter model derived from a T5-XXL checkpoint. **Adoption:** GQA was quickly adopted by the industry. Llama 2 70B, Llama 3 (all sizes), Mistral, and most modern open-source models use GQA. It's become the default attention configuration for new models because the quality-speed tradeoff is so favorable. **Paper:** "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints" (Ainslie et al., 2023) ## How MQA and GQA Affect Inference The practical impact on inference systems: **Smaller KV cache = more concurrent requests.** With GQA-8 instead of MHA-32, the KV cache is 4x smaller per request. On a GPU with fixed memory, this means you can serve 4x more concurrent users (or handle 4x longer contexts) before running out of memory. This interacts directly with PagedAttention (covered in our [vLLM post](/blog/paged-attention-and-vllm)), since the pages are smaller. **Faster decoding.** Each decode step reads less data from memory. For memory-bandwidth-bound workloads (which is essentially all autoregressive decoding), less data to read means faster generation. The speedup is roughly proportional to the reduction in KV heads. **Longer contexts become practical.** At 128K tokens, the KV cache for a 70B model with full MHA would be enormous. GQA makes long-context inference feasible by keeping the cache manageable. **Works with everything else.** MQA and GQA are architectural choices made during model design, and they're fully compatible with all the other inference optimizations: FlashAttention, speculative decoding, continuous batching, quantization, and so on. The benefits stack. ## The Design Space It's worth noting where MQA/GQA sit in the broader design space of "how do we make the KV cache smaller": - **MQA/GQA** reduce the KV cache by sharing heads at the architectural level. Requires the model to be trained (or uptrained) with the configuration. - **KV cache quantization** reduces the cache by storing values in lower precision (FP8 or INT4 instead of FP16). Can be applied post-training. - **KV cache eviction** (H2O, StreamingLLM) reduces the cache by dropping old or unimportant tokens. Applied at serving time. - **Multi-head Latent Attention (MLA)** from DeepSeek compresses the KV cache into a low-dimensional latent vector. Requires architectural changes during training. These approaches are complementary. A model using GQA can also have its KV cache quantized to FP8 and use eviction policies for very long contexts. The reductions multiply together. ## Why This Matters on ASICs MQA and GQA were designed to reduce memory bandwidth pressure on GPUs, where reading the KV cache from HBM is the primary bottleneck during decoding. On inference-optimized ASICs, the memory architecture is fundamentally different, with much higher bandwidth relative to compute and memory layouts designed specifically for the access patterns that attention uses. General Compute is the only neocloud running entirely on inference-optimized ASICs rather than NVIDIA GPUs. The KV cache efficiency improvements from GQA still help on our hardware (smaller caches are always better), but the baseline memory bandwidth is so much higher that the gap between GQA and full MHA is smaller than it would be on GPUs. Our infrastructure is fast with either configuration, while GPU-based providers rely heavily on GQA to make decoding workable. [Sign up at generalcompute.com](https://generalcompute.com) and get $200 in free credit to try it out. ## Papers and References - [Fast Transformer Decoding: One Write-Head is All You Need](https://arxiv.org/abs/1911.02150) (Shazeer, 2019) - [GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints](https://arxiv.org/abs/2305.13245) (Ainslie et al., 2023 -- EMNLP 2023) --- # Continuous Batching: The Orca Paper That Changed LLM Serving > Before continuous batching, LLM servers wasted GPU cycles waiting for the slowest request in each batch. Orca's iteration-level scheduling fixed this with a 36x throughput improvement. - Author: General Compute - Published: 2026-03-24 - Tags: inference, papers, deep-dive - Canonical: https://www.generalcompute.com/blog/continuous-batching-the-orca-paper --- Before the Orca paper, LLM serving used static batching. You'd collect a group of requests, process them together, and wait until every request in the batch was done before starting the next batch. If one request generated 500 tokens and another generated 10, the short request would sit idle in GPU memory while the long one finished. This is massively wasteful. The Orca paper introduced continuous batching (also called iteration-level scheduling), where the server makes scheduling decisions at every single token generation step instead of at the batch level. Finished requests leave immediately and new requests join in their place, keeping the GPU busy at all times. The result was a 36.9x throughput improvement over NVIDIA's FasterTransformer on GPT-3 175B at the same latency target. ## The Problem With Static Batching In static batching, a batch of requests is treated as a single unit. All requests start together and the batch completes when the last request finishes. This creates two problems: **Head-of-line blocking.** Short requests are held hostage by long ones. A request that needs 10 tokens waits for a request that needs 500 tokens, occupying GPU memory the entire time. The short request's latency is determined by the longest request in its batch, not by its own workload. **Low GPU utilization.** As requests in a batch finish at different times, the batch gets progressively emptier. The GPU is doing work for fewer and fewer requests but still can't accept new ones until the batch completes. Utilization drops steadily over the life of each batch. For interactive applications where response length varies widely (which is basically all LLM use cases), static batching wastes the majority of available compute. ## How Continuous Batching Works Orca's key innovation is iteration-level scheduling. Instead of scheduling at the batch level, the scheduler operates at the granularity of individual token generation steps (iterations). At each iteration: 1. Generate one token for every active request in the current batch. 2. Check if any requests have finished (hit their stop token or max length). 3. Remove finished requests from the batch. 4. If there's room (memory available for KV cache), add waiting requests from the queue. 5. Repeat. This means the batch composition changes at every single step. A request might join the batch at step 47 and leave at step 82, while other requests continue around it. The paper also introduced "selective batching," which recognizes that not all operations in a transformer benefit equally from batching. Attention, for instance, has per-request KV caches that can't easily be batched across requests, while the feed-forward layers (the dense matrix multiplications) batch well. Orca applies batching selectively to the operations where it helps. ## Why the Improvement Is So Large The 36.9x throughput number sounds extreme, but it makes sense when you consider what static batching leaves on the table. With static batching, the effective batch size (number of requests actually doing useful work) starts high and declines as requests finish. On average, the GPU is underutilized for most of the batch's lifetime. With continuous batching, the effective batch size stays near the maximum at all times. As soon as one request finishes, another takes its place. The GPU is always working at full capacity. This is especially impactful for LLM workloads where output lengths vary dramatically. A chatbot might generate anywhere from 5 to 500 tokens per response. Static batching plans for the worst case. Continuous batching adapts continuously. ## The Broader Impact Continuous batching is now a standard feature in every modern LLM serving system. vLLM, TensorRT-LLM, SGLang, and every major inference provider implements some version of it. It's considered table stakes for production serving. The Orca paper also established the paradigm of thinking about LLM serving as a scheduling problem rather than just a compute problem. This opened the door for subsequent work on: - **Preemptive scheduling** (pausing low-priority requests to serve high-priority ones) - **Prefill-decode disaggregation** (running the prompt-processing phase and token-generation phase on separate hardware, since they have different scheduling characteristics) - **Priority queues and SLO-aware scheduling** (guaranteeing latency targets for different request classes) ## How This Applies to ASIC-Based Inference Continuous batching was designed to maximize GPU utilization by eliminating idle cycles. On inference-optimized ASICs, the scheduling problem looks different because the hardware is already designed to minimize idle time for inference workloads. General Compute runs entirely on inference-optimized ASICs instead of NVIDIA GPUs. We implement our own scheduling optimizations, including disaggregated inference (separating prefill and decode onto dedicated hardware), on top of ASICs that are architecturally suited for high-utilization serving. The combination of hardware that wastes fewer cycles by design and software that keeps that hardware maximally busy is a big part of why we deliver lower latency and higher throughput than GPU-based providers. [Sign up at generalcompute.com](https://generalcompute.com) and get $200 in free credit to try it out. ## Papers and References - [Orca: A Distributed Serving System for Transformer-Based Generative Models](https://www.usenix.org/conference/osdi22/presentation/yu) (Yu et al., 2022 -- OSDI 2022) --- # Medusa, EAGLE, and Sequoia: The Next Generation of Speculative Decoding > The original speculative decoding papers needed a separate draft model. Medusa, EAGLE, and Sequoia found ways to speculate faster, smarter, and without the extra model. - Author: General Compute - Published: 2026-03-24 - Tags: inference, papers, deep-dive - Canonical: https://www.generalcompute.com/blog/medusa-eagle-sequoia-next-gen-speculative-decoding --- The original speculative decoding papers (covered in our [previous post](/blog/speculative-decoding-3x-speedups-without-changing-the-model)) showed that you could get 2-3x speedups by using a small draft model to guess tokens ahead, then verifying them in bulk. But they had practical limitations: you needed to find, deploy, and serve a separate draft model alongside your target model, and the speedup was capped by how well that draft model matched the target's predictions. In 2024, three papers pushed speculative decoding significantly further. Medusa added extra prediction heads directly to the target model. EAGLE found that predicting in feature space (the model's internal representations) is easier than predicting tokens. And Sequoia figured out the optimal tree structure for verifying multiple candidate continuations at once. ## Medusa: No Draft Model Needed Medusa (Cai et al., January 2024) takes a different approach to speculation. Instead of running a separate draft model, it adds multiple lightweight "heads" on top of the target model itself. Each head predicts a token at a different future position: head 1 predicts the token at position t+1, head 2 predicts t+2, head 3 predicts t+3, and so on. These heads are small (they add less than 2% to the model's total parameter count) and can be trained on a relatively small amount of data. Since they sit on top of the target model and share its internal representations (the rich understanding the model has built up through all its layers), they have much better information to work with than a separate small model would. The clever part is how verification works. Medusa doesn't just check one linear sequence of candidates. It constructs a tree of possible continuations (for example, if head 1 predicts tokens A or B, and head 2 predicts tokens C or D, you get a tree with branches AC, AD, BC, BD) and uses tree-structured attention to verify multiple branches in a single forward pass. **Results:** 2.2-3.6x speedup on various models without needing a separate draft model at all. Medusa-1 only trains the extra heads (leaving the base model frozen), while Medusa-2 jointly fine-tunes the heads and the base model for even higher acceptance rates. **Tradeoff:** You need to train the Medusa heads for each model you want to serve, which adds a preparation step that vanilla speculative decoding doesn't require. ## EAGLE: Predicting Features Instead of Tokens EAGLE (Li et al., January 2024) started from a simple observation: predicting what token comes next is hard (that's the whole reason we need large language models in the first place). But predicting what the model's internal features (its hidden state vectors, the numerical representations it builds as it processes text) will look like at the next position is much easier, because features change more smoothly and predictably than the discrete token distribution. EAGLE trains a lightweight autoregressive head that operates on the target model's second-to-top-layer features. Given the current feature vector, it predicts the next feature vector, which is then projected to a token distribution for verification. The trick is that it also uses the token embedding from one step ahead as additional input, which resolves a lot of the uncertainty. Like Medusa, EAGLE uses tree-structured verification to check multiple candidates in one forward pass. **Results:** EAGLE achieves 2.7-3.5x latency speedup on Llama 2 Chat 70B with a provable guarantee that the output distribution is identical to standard decoding. This makes it faster than Medusa while maintaining the lossless property. **EAGLE-2** (June 2024) improved on this by making the draft tree structure context-dependent. Instead of using a fixed tree shape for every input, EAGLE-2 dynamically constructs the tree based on the confidence of each prediction, allocating more branches where the model is uncertain and fewer where it's confident. This increased the average number of accepted tokens per step without any additional training. **EAGLE-3** (March 2025) went further by abandoning feature prediction entirely in favor of direct token prediction, combined with multi-layer feature fusion. Earlier EAGLE versions hit diminishing returns when trained on more data. EAGLE-3's architecture scales better, continuing to improve with more training examples. ## Sequoia: Hardware-Aware Optimal Trees Sequoia (Chen et al., February 2024) approached the problem from a systems perspective. Both Medusa and EAGLE use tree-structured verification, but how do you pick the best tree shape? Sequoia uses dynamic programming (an algorithmic technique for finding optimal solutions by breaking problems into subproblems) to find the tree topology (number of branches, depth at each level) that maximizes the expected number of accepted tokens, given the draft model's token probabilities. The optimal tree shape depends on the draft model's accuracy and the available compute budget. Critically, Sequoia also makes the tree structure hardware-aware. The optimal tree for an A100 (high memory bandwidth, moderate batch capacity) is different from the optimal tree for an L40 (less bandwidth, different compute characteristics) or a CPU-offloaded setup. Sequoia's optimizer automatically adapts to the target hardware. **Results:** Up to 4.04x speedup on an A100 for Llama 2 7B. And for offloaded inference (where the model partially lives in CPU memory or NVMe storage because it doesn't fit entirely in GPU memory), Sequoia achieves up to 9.96x speedup, bringing Llama 2 70B to 0.56 seconds per token on an L40 GPU that couldn't practically serve the model otherwise. Sequoia also introduced a novel sampling and verification method that works well at higher temperatures (where the model's output is more random and creative). This was a weakness of earlier speculative decoding methods, which tended to see lower acceptance rates with high-temperature sampling. ## How They Compare | Method | Speedup (70B) | Needs Draft Model? | Lossless? | Extra Training? | |---|---|---|---|---| | Vanilla Speculative Decoding | 2-3x | Yes (separate model) | Yes | No | | Medusa | 2.2-3.6x | No (heads on target) | Yes | Yes (heads) | | EAGLE | 2.7-3.5x | No (feature predictor) | Yes | Yes (predictor) | | EAGLE-2 | Better than EAGLE | No | Yes | Same as EAGLE | | Sequoia | Up to 4x (9.96x offloaded) | Yes | Yes | No | The general trend: each new method finds a smarter way to speculate. Medusa eliminated the separate draft model. EAGLE made predictions more accurate by working in feature space. EAGLE-2 made the verification tree adaptive. Sequoia optimized the tree shape for specific hardware. ## Prompt Lookup Decoding: The Zero-Overhead Approach Worth mentioning alongside these methods: prompt lookup decoding (Apoorv Saxena, November 2023) is the simplest form of speculation. It doesn't use a model at all. Instead, it looks for n-gram matches (repeating sequences of tokens) between the input prompt and recently generated text. When it finds a match, it uses the tokens that followed that pattern in the prompt as draft candidates. This is surprisingly effective for tasks where the output is likely to repeat parts of the input: summarization, question answering with context, code editing, and structured output. It achieves 2-4x speedup on these tasks with literally zero model overhead. It's now built into HuggingFace Transformers and vLLM. ## Why ASICs Compound These Gains All of these techniques share a common foundation: they get more useful tokens out of each target model forward pass. On GPUs, each forward pass is memory-bandwidth-bound (the GPU spends most of its time waiting to read model weights from memory), so you're fundamentally limited by how fast the memory bus can deliver data. General Compute is the only neocloud built entirely on inference-optimized ASICs instead of NVIDIA GPUs. On these chips, the memory bandwidth equation is fundamentally different. The baseline forward-pass latency is already much lower, and when speculative decoding techniques generate multiple tokens per pass, each of those "free" tokens arrives faster. A technique that gives you 3x more tokens per forward pass on a GPU with 70ms per pass saves you 140ms. The same technique on an ASIC with 20ms per pass still saves you 40ms, but your absolute latency is dramatically lower. The gains from speculative decoding and the gains from inference-optimized hardware multiply together. [Sign up at generalcompute.com](https://generalcompute.com) and get $200 in free credit to see what compounded inference optimization feels like. ## Papers and References - [Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads](https://arxiv.org/abs/2401.10774) (Cai et al., 2024 -- ICML 2024) - [EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty](https://arxiv.org/abs/2401.15077) (Li et al., 2024 -- ICML 2024) - [EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees](https://arxiv.org/abs/2406.16858) (Li et al., 2024 -- EMNLP 2024) - [EAGLE-3: Scaling up Inference Acceleration via Training-Time Test](https://arxiv.org/abs/2503.01840) (Li et al., 2025) - [Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding](https://arxiv.org/abs/2402.12374) (Chen et al., 2024 -- NeurIPS 2024) - [Prompt Lookup Decoding](https://github.com/apoorvumang/prompt-lookup-decoding) (Saxena, 2023) --- # SGLang and RadixAttention: Smarter KV Cache Reuse > SGLang's RadixAttention stores KV cache in a radix tree, enabling automatic prefix sharing across requests. The result is up to 5x higher throughput for multi-turn and structured workloads. - Author: General Compute - Published: 2026-03-24 - Tags: inference, papers, deep-dive - Canonical: https://www.generalcompute.com/blog/sglang-and-radix-attention --- When you send a request to an LLM API, the server computes the KV cache (the model's working memory) for your entire prompt from scratch. If your next request shares the same system prompt, the server computes that part again. If ten users have the same system prompt, it gets computed ten times. This is a huge amount of redundant work. In multi-turn conversations, RAG pipelines (where you retrieve documents and include them in the prompt), and few-shot prompting (where you include examples in every request), the majority of the prompt is identical across requests. Recomputing the KV cache for shared prefixes wastes both time and GPU compute. SGLang's RadixAttention solves this by storing KV cache in a radix tree data structure that automatically detects and reuses shared prefixes across requests. ## The Prefix Sharing Opportunity Consider a few common patterns: **Multi-turn chat.** Each message in a conversation shares the entire history of previous messages. Turn 5 of a conversation has the same prefix (turns 1-4) as any other request continuing that conversation. **System prompts.** Most API deployments use the same system prompt for every request. If your system prompt is 500 tokens, that's 500 tokens of redundant KV cache computation for every single request. **Few-shot prompting.** If you include 5 examples in every request, those examples are identical across all requests and could share the same KV cache. **RAG with common documents.** When multiple users ask questions about the same retrieved document, the document's KV cache could be computed once and shared. In all these cases, you're paying the full prefill cost (which is the compute-intensive phase of inference) for work that's already been done. ## How RadixAttention Works A radix tree (also called a Patricia trie) is a data structure that stores strings by their shared prefixes. If you insert "hello world" and "hello there", the tree stores "hello " once and branches at the point where the strings diverge. SGLang applies this to KV cache management. Each node in the radix tree stores a segment of KV cache corresponding to a sequence of tokens. When a new request arrives: 1. The server tokenizes the prompt and walks the radix tree, following matching token sequences. 2. At the point where the tree and the new request diverge, all the KV cache up to that point is reused. No recomputation needed. 3. Only the new, unmatched portion of the prompt goes through prefill. 4. After the request completes, the new KV cache segments are inserted into the tree for future reuse. The tree uses LRU eviction (least recently used entries get dropped first) when GPU memory is full, so popular prefixes stay cached while rare ones are cleaned up automatically. ## Cache-Aware Scheduling SGLang also introduces cache-aware scheduling, which reorders requests in the queue to maximize cache hit rates. If the server has a batch of waiting requests and some of them share prefixes with currently cached KV data, those requests get prioritized. This sounds like a small optimization, but it matters a lot in practice. Without cache-aware scheduling, the server might process requests in FIFO order (first in, first out), evicting cached prefixes before other requests that could have used them arrive. With it, the server batches related requests together and keeps useful cache entries warm. ## Structured Language Model Programs Beyond caching, SGLang also provides a programming model for structured LLM interactions. Instead of making individual API calls, you write programs that describe multi-step LLM workflows: ```python @function def multi_step_qa(s, question): s += system("You are a helpful assistant.") s += user(question) s += assistant(gen("answer", max_tokens=256)) s += user("Can you elaborate on that?") s += assistant(gen("elaboration", max_tokens=512)) ``` The serving system sees the entire program structure and can optimize accordingly: pre-allocating cache for the expected conversation flow, scheduling both generation steps together, and reusing the cache from the first turn for the second. ## Results SGLang achieves up to 5x higher throughput over baseline serving systems on workloads with prefix sharing opportunities. The improvement is highest for: - Multi-turn conversations: 3-5x improvement (long shared prefixes) - Few-shot prompting: 2-4x improvement (identical example prefixes) - Tree-structured generation (like beam search): 2-3x improvement (shared prefix branches) Even for single-turn workloads without obvious prefix sharing, SGLang performs comparably to vLLM because the radix tree adds minimal overhead when there's nothing to cache. ## How This Fits in Our Stack Prefix caching is one of those optimizations that becomes more valuable as inference gets faster. When prefill is slow, saving a few hundred milliseconds of redundant computation is nice but not transformative. When prefill is already fast (as it is on inference-optimized ASICs), the savings from prefix caching represent a larger fraction of the total request time, and you can serve proportionally more requests with the freed-up compute. General Compute is the only neocloud built entirely on inference-optimized ASICs. We implement our own KV cache management and prefix sharing on top of hardware that's already fast at prefill. The combination means that requests with shared prefixes, which includes most production workloads, see compounding speed benefits. [Sign up at generalcompute.com](https://generalcompute.com) and get $200 in free credit to try it out. ## Papers and References - [SGLang: Efficient Execution of Structured Language Model Programs](https://arxiv.org/abs/2312.07104) (Zheng et al., 2024 -- NeurIPS 2024) - [SGLang Blog Post](https://lmsys.org/blog/2024-01-17-sglang/) (LMSYS, 2024) --- # Speculative Decoding: Getting 3x Speedups Without Changing the Model > Speculative decoding uses a small draft model to predict multiple tokens ahead, then verifies them all at once. The result is mathematically identical output, 2-3x faster. - Author: General Compute - Published: 2026-03-23 - Tags: inference, papers, deep-dive - Canonical: https://www.generalcompute.com/blog/speculative-decoding-3x-speedups-without-changing-the-model --- LLMs generate text one token at a time. Each token requires a full forward pass through the model, and each pass is bottlenecked by memory bandwidth (how fast you can read the model's weights from memory), not by compute (how fast you can do the math). This means the GPU sits mostly idle during generation, waiting on memory. Speculative decoding attacks this problem with a simple idea: use a small, fast model to guess multiple tokens ahead, then verify all those guesses in a single pass through the large model. When the guesses are right (and they often are), you get multiple tokens for the cost of one large-model pass. The best part: the output is mathematically identical to what the large model would have produced on its own. No quality tradeoff. ## Why Standard Decoding Is Slow To understand why speculative decoding helps, you need to understand why normal decoding is inefficient. During generation, each forward pass through a large model (say, 70 billion parameters) requires reading all those parameters from GPU memory. On an A100, reading 70B parameters in FP16 means moving about 140GB of data through a memory bus that tops out at around 2TB/s. That's roughly 70ms just for the memory transfer, regardless of how fast the math is. The actual matrix multiplications for a single token use only a small fraction of the GPU's compute capacity. The arithmetic intensity (ratio of compute to memory access) is very low during decoding. The GPU's tensor cores are mostly idle, waiting for data to arrive from memory. This means that processing one token and processing several tokens in parallel costs almost the same wall-clock time, because the bottleneck is reading the model weights, and you read those weights regardless of how many tokens you're processing (this is why the prefill phase, where you process the entire input prompt in parallel, is much more efficient per token than decoding). Speculative decoding exploits exactly this property. ## How It Works The algorithm uses two models: a small, fast "draft" model and the full-size "target" model you actually want to serve. **Step 1: Draft.** The small model (something like a 1-2B parameter model from the same family) generates K candidate tokens autoregressively. Because the draft model is tiny, this is very fast, maybe 5-10ms for K=5 tokens. **Step 2: Verify.** Feed all K draft tokens into the target model in a single forward pass. The target model processes them in parallel (like a mini-prefill), producing probability distributions for each position. This single pass costs about the same as generating one token normally. **Step 3: Accept or reject.** For each draft token, compare the draft model's probability with the target model's probability using a specific acceptance criterion: - Accept the token with probability min(1, p_target(token) / p_draft(token)) - If rejected, resample from a corrected distribution: normalize(max(0, p_target - p_draft)) **Step 4: Return.** All accepted tokens plus one new token (either the resampled replacement or the next token after all accepted) become the output for this step. The acceptance/rejection scheme is the mathematical core. It guarantees that the final output distribution is exactly equal to sampling from the target model alone. This isn't an approximation or a heuristic. It's a provable guarantee. You get identical quality with fewer target model forward passes. ## How Much Faster Is It? The speedup depends on how well the draft model matches the target model's distribution. When the draft model predicts the same tokens the target model would have chosen (which happens frequently for common patterns, boilerplate code, and predictable text), most tokens get accepted and you get close to K+1 tokens per target model pass. In practice, typical acceptance rates are 70-85% for well-matched draft/target pairs (like using Llama 3 8B to draft for Llama 3 70B). This translates to 2-3x wall-clock speedups on generation. The speedup formula is roughly: (average_tokens_accepted + 1) / (K * cost_draft/cost_target + 1). Since the draft model is 10-50x smaller, the cost_draft/cost_target ratio is very small, so the denominator stays close to 1. ## The Two Original Papers Speculative decoding was independently discovered by two teams at almost the same time: **Leviathan et al. (Google, November 2022)** published "Fast Inference from Transformers via Speculative Decoding" and demonstrated the technique on T5-XXL, showing 2-3x acceleration with no quality degradation. They formally proved the output distribution equivalence. **Chen et al. (DeepMind, February 2023)** published "Accelerating Large Language Model Decoding with Speculative Sampling" and validated the approach on Chinchilla 70B in distributed settings, showing 2-2.5x speedups. They called their version "speculative sampling" and provided a slightly different but equivalent mathematical formulation. Both papers arrived at the same core idea independently, which usually means the idea is fundamental. And it has proven to be exactly that. Speculative decoding is now supported in every major serving framework (vLLM, TensorRT-LLM, SGLang) and used by most inference providers. ## Where Speculative Decoding Shines The technique works best when: - **The draft model is a good predictor of the target.** Models from the same family work well (Llama 8B drafting for Llama 70B). The more the distributions align, the higher the acceptance rate. - **The output is somewhat predictable.** Code generation, structured output (JSON), and formulaic text have high acceptance rates. Creative, high-temperature generation has lower rates. - **You care about latency, not just throughput.** Speculative decoding helps individual request latency. Under very high load, the extra compute for the draft model can actually reduce overall throughput. It's a latency optimization, not a throughput optimization. - **The model is large enough that decoding is memory-bound.** For very small models (7B and under), decoding is already fast enough that the overhead of running a draft model doesn't pay off. ## The Hardware Angle Speculative decoding is, at its core, a workaround for the memory bandwidth bottleneck of GPU-based inference. The entire technique exists because reading 70B+ parameters from HBM is slow, and the GPU's compute capacity goes to waste during that read. General Compute runs entirely on inference-optimized ASICs instead of NVIDIA GPUs, and the memory bandwidth equation on these chips is fundamentally different. The bottleneck between memory and compute is much narrower for inference workloads, which means the baseline decoding speed is already closer to what speculative decoding tries to achieve on GPUs. And when we apply speculative decoding on top of that, the gains compound on an already-fast baseline. The result is inference speed that GPU-based systems can't match even with perfect speculative decoding implementations. [Sign up at generalcompute.com](https://generalcompute.com) and get $200 in free credit to see for yourself. ## Papers and References - [Fast Inference from Transformers via Speculative Decoding](https://arxiv.org/abs/2211.17192) (Leviathan et al., 2022 -- ICML 2023) - [Accelerating Large Language Model Decoding with Speculative Sampling](https://arxiv.org/abs/2302.01318) (Chen et al., 2023) --- # PagedAttention and vLLM: Virtual Memory for LLM Serving > The PagedAttention paper solved the biggest memory waste problem in LLM serving by borrowing an idea from operating systems. Here's how it works and why vLLM became the default serving framework. - Author: General Compute - Published: 2026-03-22 - Tags: inference, papers, deep-dive - Canonical: https://www.generalcompute.com/blog/paged-attention-and-vllm --- Before PagedAttention, LLM serving systems wasted 60-80% of their GPU memory on empty space. The KV cache (the per-request memory that stores the model's "working memory" of the conversation so far) had to be allocated as a single contiguous block when a request came in. Since you don't know how long a response will be in advance, systems would allocate for the maximum possible length, leaving most of that memory unused. The vLLM team at UC Berkeley looked at this problem and recognized it was the same problem that operating systems solved decades ago with virtual memory and paging. ## The KV Cache Problem During autoregressive generation (where the model produces one token at a time), each new token needs to attend to (look back at) all previous tokens. The key and value tensors for those previous tokens are cached in GPU memory so they don't need to be recomputed every time. This stored state is called the KV cache, and it grows linearly with the length of the conversation. For a model like Llama 2 13B with a maximum context of 4096 tokens, the KV cache for a single request can require around 1.6GB of GPU memory. On a 40GB A100 GPU, you can only fit about 25 concurrent requests if each one reserves its full maximum allocation. The waste comes from pre-allocation. If a request only generates 100 tokens, the other 3996 tokens worth of allocated KV cache memory sits empty. Across many concurrent requests, this internal fragmentation (allocated but unused memory) eats up the majority of available GPU memory. Before PagedAttention, the two options were: allocate conservatively and waste memory (limiting how many requests you can serve at once), or allocate tightly and risk running out of space mid-generation (causing requests to fail). ## How PagedAttention Works PagedAttention borrows directly from how operating systems manage virtual memory. If you've taken an OS class, the concept will feel familiar. Instead of allocating one big contiguous block per request, the KV cache is divided into fixed-size blocks called pages, typically holding the KV data for 16 tokens each. The key ideas: **Non-contiguous storage.** A request's KV cache doesn't need to be in a single contiguous chunk of memory. It's stored across pages that can be scattered anywhere in GPU memory, linked together by a page table (a lookup structure that maps logical positions to physical locations, just like how your operating system manages RAM). **Allocate on demand.** Pages are only allocated as new tokens are generated. A request that produces 100 tokens uses pages for those 100 tokens, not the maximum context length. No more over-allocation. **Memory sharing.** When multiple requests share the same prompt prefix (this is common when many users have the same system prompt), they can share the same physical KV cache pages. Only pages that diverge between requests need separate storage. This is similar to copy-on-write in operating systems. The results from the paper: memory waste dropped from 60-80% to under 4%. This directly translated to 2-4x higher serving throughput because you can fit many more concurrent requests in the same GPU memory. ## vLLM: The Serving System Built on PagedAttention The authors didn't just publish a paper. They built vLLM, an open-source serving engine with PagedAttention at its core. It quickly became the most widely used LLM serving framework in the industry. Beyond PagedAttention, vLLM includes: - **Continuous batching.** New requests can join an in-progress batch at any iteration, so the GPU never sits idle waiting for a slow request to finish (this technique originated in the Orca paper, which we cover in a separate post). - **Prefix caching.** Automatic detection and reuse of shared prompt prefixes across requests, so the model doesn't redo work it's already done. - **Speculative decoding.** Built-in support for using a smaller "draft" model to speed up generation from a larger model. - **Tensor parallelism.** Splitting a model across multiple GPUs so you can serve models that don't fit on a single card. - **Quantization support.** GPTQ, AWQ, FP8, and other formats that shrink model weights to use less memory and run faster. vLLM's adoption was rapid because it solved a practical problem that every LLM deployment was hitting. Before vLLM, teams were writing custom serving code or using NVIDIA's FasterTransformer (which predated many of these optimizations). vLLM made it possible to serve models at 2-4x higher throughput with the same hardware, just by being smarter about memory. ## The Broader Impact PagedAttention changed how people think about LLM serving infrastructure. The realization that memory management, not just compute, was the primary bottleneck opened up a wave of follow-on work. SGLang's RadixAttention took the prefix-sharing idea further with a radix tree data structure for more granular cache reuse. Disaggregated inference (running the prompt-processing phase and the token-generation phase on separate hardware) became practical partly because PagedAttention made memory management flexible enough to support it. And the core question of "how many requests can I serve at once" shifted from being a GPU compute question to a GPU memory management question. The paper also showed something important about inference optimization: sometimes the biggest wins come not from making the math faster, but from eliminating waste in how memory and resources are managed around the math. ## Why Custom Hardware Goes Further PagedAttention is a clever software solution to a real hardware limitation. GPUs allocate memory in a general-purpose way, and LLM serving workloads have unusual memory access patterns that don't map well to how GPUs were designed to work. The paging system adds overhead (page table lookups, non-contiguous memory access patterns) that wouldn't be necessary if the hardware understood the workload natively. At General Compute, we run entirely on inference-optimized ASICs instead of NVIDIA GPUs. These chips handle memory allocation and KV cache management at the hardware level. The memory fragmentation problem that PagedAttention solves in software is addressed architecturally. There's no page table overhead, no fragmentation, and no gap between allocated and used memory. This is one of the reasons we can serve more concurrent requests at lower latency than GPU-based systems running vLLM. If you want to see what LLM serving looks like without GPU memory constraints, [sign up at generalcompute.com](https://generalcompute.com) and get $200 in free credit to try it out. ## Papers and References - [Efficient Memory Management for Large Language Model Serving with PagedAttention](https://arxiv.org/abs/2309.06180) (Kwon et al., 2023 -- SOSP 2023) - [vLLM Blog Post](https://blog.vllm.ai/2023/06/20/vllm.html) (UC Berkeley, 2023) --- # FlashAttention: How Tri Dao Made Attention 4x Faster > FlashAttention rewrote the rules of transformer inference by treating attention as a memory problem, not a compute problem. Here's how it works and why it matters. - Author: General Compute - Published: 2026-03-21 - Tags: inference, papers, deep-dive - Canonical: https://www.generalcompute.com/blog/flash-attention-how-tri-dao-made-attention-4x-faster --- Before FlashAttention, the standard implementation of transformer attention was doing something fundamentally wasteful. It was computing a massive N x N attention matrix, writing it to GPU memory, reading it back, and then using it. For long sequences, this intermediate matrix was enormous, and the time spent moving it between GPU memory tiers dominated the actual math. Tri Dao's insight was that attention isn't a compute-bound problem. It's a memory-bandwidth problem. And by restructuring how the computation flows through the GPU's memory hierarchy, you can make it dramatically faster without approximating anything. ## The Problem With Standard Attention In a standard transformer, the attention mechanism computes Q * K^T to get an N x N matrix of attention scores (where N is the sequence length), applies softmax, and then multiplies by V. The naive implementation materializes that full N x N matrix in GPU high-bandwidth memory (HBM). The issue is that HBM reads and writes are slow relative to the actual arithmetic. GPU SRAM (on-chip memory) is fast but small (around 20MB on an A100, compared to 40-80GB of HBM). The standard attention implementation is constantly shuttling data between SRAM and HBM, and most of the wall-clock time is spent on those memory transfers, not on the matrix multiplications themselves. For a sequence length of 4096 with a hidden dimension of 128, the attention score matrix alone is 4096 x 4096 x 2 bytes = 32MB per head. At longer sequences this gets worse quadratically. The GPU spends more time moving data than doing math. ## How FlashAttention Works FlashAttention restructures the attention computation using a technique called tiling. Instead of computing the full N x N attention matrix at once, it processes attention in small blocks that fit entirely in SRAM. The algorithm: 1. Divide Q, K, and V matrices into blocks that fit in SRAM. 2. For each block of Q, load it into SRAM once. 3. Iterate over blocks of K and V, computing partial attention scores in SRAM. 4. Accumulate the softmax and output incrementally using an online softmax algorithm (tracking running max and sum statistics). 5. Write only the final output back to HBM. The N x N attention matrix is never materialized. The key mathematical trick is the online softmax. Normal softmax requires knowing the maximum value across the entire row before you can compute any outputs. The online version maintains running statistics that get corrected as new blocks are processed, producing numerically identical results without needing the full row in memory at once. The result: FlashAttention is exact (not an approximation), uses O(N) memory instead of O(N^2), and runs 2-4x faster than standard attention on GPUs like the A100. ## FlashAttention-2: Better Parallelism FlashAttention-2 (July 2023) improved on the original by rethinking how work is distributed across the GPU's processing units. The main changes: - Reduced non-matmul FLOPs. The original FlashAttention spent a significant fraction of time on rescaling, softmax, and other non-tensor-core operations. V2 restructured the algorithm to minimize these. - Better parallelism across the sequence length dimension. V1 parallelized over batch size and number of heads. V2 also parallelized over the sequence length, which matters for long sequences with small batch sizes (common in inference). - Better work partitioning between warps within a thread block, reducing shared memory reads/writes. These changes brought FlashAttention-2 to around 230 TFLOPs/s on an A100, roughly 2x faster than FlashAttention-1 and close to the theoretical maximum of the hardware. ## FlashAttention-3: Exploiting Hopper FlashAttention-3 (July 2024) was designed specifically for NVIDIA's Hopper architecture (H100 GPUs), which introduced new hardware capabilities that the previous versions couldn't use. Three key techniques: **Warp specialization with asynchronous execution.** Hopper has a new asynchronous programming model where different warp groups can overlap computation and memory transfers. FlashAttention-3 designates some warps as "producers" (loading data) and others as "consumers" (doing math), running them concurrently. **Interleaved block-wise matmul and softmax.** Instead of waiting for all matmuls to finish before computing softmax (or vice versa), FA-3 interleaves these operations to keep the tensor cores busy while softmax runs on the CUDA cores. **FP8 support with incoherent processing.** FA-3 adds block-wise FP8 quantization for attention, achieving ~1.2 PFLOPs/s in FP8 while keeping numerical error 2.6x lower than naive FP8 attention through a technique called incoherent processing (applying random orthogonal transformations to spread quantization error). The result is 740 TFLOPs/s in FP16 (75% hardware utilization) on H100. ## Flash-Decoding: Fixing the Decode Phase There's a specific problem during autoregressive decoding (generating one token at a time) that the main FlashAttention papers didn't fully address. During decoding, the query length is 1 (just the new token), but the key/value length can be very long (the entire context). Standard FlashAttention parallelizes over batch size, heads, and query length, but when query length is 1, there's almost nothing to parallelize. Flash-Decoding (October 2023) added a new parallelization dimension: splitting across the key/value sequence. The KV sequence is divided into chunks, each chunk computes partial attention with the single query in parallel, and the results are combined with a log-sum-exp correction. This is especially important for long-context inference. Without Flash-Decoding, a single decode step with 100K context might use less than 1% of the GPU's compute capacity. With it, you can actually saturate the hardware, achieving up to 8x faster decoding for long sequences. ## Why This Matters (And Why ASICs Go Further) FlashAttention and its descendants are now built into essentially every production LLM serving system. vLLM, TensorRT-LLM, SGLang, and every major inference provider uses some variant of these kernels. The impact is real: prefill got 2-4x faster, long-context decoding became usable, and 128K+ context windows went from impractical to standard. But here's the thing worth noting. FlashAttention exists because GPUs have a fundamental architectural mismatch for inference workloads. The entire family of techniques is about working around the bottleneck of moving data between HBM and SRAM on a chip that was designed for graphics rendering and general-purpose parallel compute. General Compute is the only neocloud built entirely on inference-optimized ASICs instead of NVIDIA GPUs. The memory bandwidth constraints that FlashAttention was built to solve are addressed at the hardware level on these chips. There's no need to tile around a slow memory bus because the memory architecture is purpose-built for the access patterns that transformer inference actually needs. Combined with our own optimizations like disaggregated inference, this is why we're fundamentally faster than GPU-based providers, even ones running FlashAttention. If you want to see the difference that purpose-built inference hardware makes, [sign up at generalcompute.com](https://generalcompute.com) and get $200 in free credit to try it out. ## Papers and References - [FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness](https://arxiv.org/abs/2205.14135) (Dao et al., 2022) - [FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning](https://arxiv.org/abs/2307.08691) (Dao, 2023) - [FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision](https://arxiv.org/abs/2407.08608) (Shah et al., 2024) - [Flash-Decoding for long-context inference](https://crfm.stanford.edu/2023/10/12/flashdecoding.html) (Stanford CRFM, 2023) --- # Build a Real-Time Voice AI Agent with General Compute > A step-by-step tutorial for building a voice AI agent with sub-500ms response times. Plus: why General Compute is the only provider fast enough to use reasoning models in a voice pipeline. - Author: General Compute - Published: 2026-03-20 - Tags: voice-ai, tutorial, agents - Canonical: https://www.generalcompute.com/blog/build-a-real-time-voice-ai-agent --- Most voice AI feels sluggish. You say something, wait a beat too long, and the illusion of a natural conversation breaks. The problem usually isn't the speech-to-text or text-to-speech models. It's the LLM inference in the middle. In this tutorial, we'll build a real-time voice AI agent that responds in under 500ms end-to-end. We'll also show something that no other inference provider can currently offer: using a reasoning model in a voice pipeline without blowing through your latency budget. ## How Voice AI Agents Work A voice AI agent is a three-stage pipeline: 1. **Speech-to-Text (STT):** Converts the user's audio into text. Typical latency: 100-300ms. 2. **LLM Inference:** Processes the transcribed text and generates a response. Typical latency: 200-2000ms. 3. **Text-to-Speech (TTS):** Converts the LLM's text response back into audio. Typical latency: 100-300ms. The LLM step accounts for 50-70% of total latency in most setups. Human conversational turn-taking has a natural gap of about 200-300ms. Anything above a second feels like you're talking to someone on a bad connection. Anything above two seconds and users start checking if the thing is frozen. The critical insight: TTS needs to start playing as soon as the first tokens arrive from the LLM. You stream tokens out of the model and into the speech synthesizer in real time. This means time-to-first-token (TTFT) matters more than total generation time for perceived responsiveness. In voice, TTFT directly determines time-to-first-audio-token (TTFAT), which is what the user actually perceives. ## The Reasoning Model Problem in Voice AI Here's something that doesn't get talked about enough in the voice AI space: everyone is stuck using basic chat models. Reasoning models (DeepSeek R1, Qwen QwQ, models with chain-of-thought) produce significantly better answers for complex queries. They think through problems step by step before responding. For a customer support agent that needs to reason about a billing issue, or a medical triage bot that needs to weigh symptoms, the quality difference between a standard chat model and a reasoning model is substantial. But reasoning models have a problem for voice: they think before they speak. That thinking phase adds hundreds of milliseconds to multiple seconds of latency before the first useful token comes out. On most inference providers, the TTFT for a reasoning model is so high that it completely destroys the conversational experience. You'd be asking users to sit in silence for 3-5 seconds while the model thinks. That's unusable for voice. This is why virtually every voice AI company today is limited to standard chat models. The TTFAT budget is too tight for reasoning on slow infrastructure. **With General Compute, the math changes. Our inference is fast enough that you can run a reasoning model and still hit voice-grade latency targets.** The thinking phase that takes 2-3 seconds on other providers happens in a few hundred milliseconds on our infrastructure. That means you can give your voice agent the ability to actually reason through complex questions while still responding fast enough to maintain natural conversation flow. This is a meaningful capability gap. Your competitors' voice agents are limited to pattern-matching with chat models. Yours can think. ## Choosing the Stack For this tutorial we'll use: - **STT:** [Deepgram](https://deepgram.com) -- fast streaming transcription, generous free tier - **LLM:** Llama 3.3 70B via General Compute (and optionally a reasoning model for complex queries) - **TTS:** [Cartesia Sonic](https://cartesia.ai) -- low-latency, high-quality streaming voice synthesis - **Framework:** [Pipecat](https://pipecat.ai) -- open-source Python framework for voice AI pipelines - **Transport:** [Daily](https://daily.co) -- WebRTC, built into Pipecat for browser-based interaction Why Pipecat? It handles the plumbing of wiring STT, LLM, and TTS together with proper streaming, interruption handling, and voice activity detection. It supports OpenAI-compatible providers out of the box, which means General Compute works with no custom integration. ## Setting Up You'll need Python 3.10+ and API keys for each service. Install the dependencies: ```bash pip install "pipecat-ai[daily,deepgram,cartesia,openai]" ``` Set up your environment variables: ```bash export GENERAL_COMPUTE_API_KEY="your-gc-api-key" export DEEPGRAM_API_KEY="your-deepgram-api-key" export CARTESIA_API_KEY="your-cartesia-api-key" export DAILY_API_KEY="your-daily-api-key" ``` The General Compute API key works just like an OpenAI key. Sign up at [generalcompute.com](https://generalcompute.com) to get one. ## Building the Agent Here's the full agent. We'll walk through each part below. ```python import asyncio import os from pipecat.frames.frames import LLMMessagesFrame from pipecat.pipeline.pipeline import Pipeline from pipecat.pipeline.runner import PipelineRunner from pipecat.pipeline.task import PipelineTask, PipelineParams from pipecat.services.openai import OpenAILLMService from pipecat.services.deepgram import DeepgramSTTService from pipecat.services.cartesia import CartesiaTTSService from pipecat.transports.services.daily import DailyTransport, DailyParams async def main(): # Transport -- WebRTC via Daily transport = DailyTransport( room_url="", # Will be created automatically token="", bot_name="Voice Agent", params=DailyParams(audio_out_enabled=True, audio_in_enabled=True), ) # Speech-to-Text stt = DeepgramSTTService(api_key=os.getenv("DEEPGRAM_API_KEY")) # LLM -- General Compute via OpenAI-compatible API llm = OpenAILLMService( api_key=os.getenv("GENERAL_COMPUTE_API_KEY"), base_url="https://api.generalcompute.com", model="llama-3.3-70b", ) # Text-to-Speech tts = CartesiaTTSService( api_key=os.getenv("CARTESIA_API_KEY"), voice_id="79a125e8-cd45-4c13-8a67-188112f4dd22", # Friendly voice ) # System prompt -- keep it concise for voice messages = [ { "role": "system", "content": ( "You are a helpful voice assistant. Keep your responses " "concise -- one to two sentences when possible. Be natural " "and conversational. Don't use markdown, bullet points, or " "formatting since your responses will be spoken aloud." ), } ] # Build the pipeline: STT -> LLM -> TTS pipeline = Pipeline([ transport.input(), stt, llm, tts, transport.output(), ]) task = PipelineTask( pipeline, params=PipelineParams( allow_interruptions=True, enable_metrics=True, ), ) # Send the initial context to prime the LLM await task.queue_frame(LLMMessagesFrame(messages)) runner = PipelineRunner() await runner.run(task) if __name__ == "__main__": asyncio.run(main()) ``` ### The LLM Configuration The important part is here: ```python llm = OpenAILLMService( api_key=os.getenv("GENERAL_COMPUTE_API_KEY"), base_url="https://api.generalcompute.com", model="llama-3.3-70b", ) ``` Because General Compute's API is OpenAI-compatible, Pipecat's built-in OpenAI service works without modifications. You point it at GC's base URL and you're done. If you're already using another OpenAI-compatible provider, switching is a one-line change. ### Swapping in a Reasoning Model Want your voice agent to actually think through complex questions? Change the model: ```python llm = OpenAILLMService( api_key=os.getenv("GENERAL_COMPUTE_API_KEY"), base_url="https://api.generalcompute.com", model="deepseek-r1-0528", ) ``` On General Compute, DeepSeek R1's thinking phase is fast enough that the additional latency stays within voice-grade bounds. On other providers, this same model would add seconds of silence before the first word is spoken. You can also build a hybrid approach: route simple queries to a fast chat model and complex queries to a reasoning model. Pipecat's pipeline is modular enough to support this with a classification step before the LLM. ### Prompt Engineering for Voice Voice agents need different prompts than chat agents. A few things to keep in mind: - **Short responses.** A three-paragraph answer that looks great in a chat UI is painful to listen to. Instruct the model to keep responses to one or two sentences. - **No formatting.** Markdown, bullet points, and numbered lists don't translate to speech. Tell the model to write in plain, conversational language. - **Conversational tone.** Written text and spoken text sound different. "I'd be happy to assist you with that" sounds robotic when spoken aloud. "Sure, here's what I found" sounds natural. ### Handling Interruptions Real conversations involve interruptions. A user might start talking while the agent is still responding. Pipecat handles this through Voice Activity Detection (VAD). When it detects the user speaking, it stops the current TTS output and processes the new input. This is enabled with `allow_interruptions=True` in the pipeline params. Without it, the agent would finish its entire response before listening again, which feels unnatural. ## Measuring Latency Once your agent is running, you'll want to measure where time is being spent. Pipecat's `enable_metrics=True` flag logs timing for each pipeline stage. The metrics you care about: - **TTFT (Time to First Token):** How long after STT completes does the LLM start generating? This is the single most important number for voice AI. - **TTFAT (Time to First Audio Token):** End-to-end time from user silence to agent audio. This is what the user actually experiences. - **TPS (Tokens Per Second):** How fast the LLM generates output. Higher TPS means the spoken response keeps up without awkward pauses mid-sentence. You can also measure the LLM step in isolation: ```python import time async def measure_llm_latency(llm, messages): start = time.perf_counter() first_token_time = None response = await llm.client.chat.completions.create( model=llm.model, messages=messages, stream=True, ) async for chunk in response: if chunk.choices[0].delta.content and first_token_time is None: first_token_time = time.perf_counter() print(f"TTFT: {(first_token_time - start) * 1000:.0f}ms") total = time.perf_counter() - start print(f"Total generation: {total * 1000:.0f}ms") ``` With General Compute serving Llama 3.3 70B, you should see TTFT in the 80-150ms range. With a reasoning model like DeepSeek R1, the thinking overhead adds some latency, but it stays well under the 500ms TTFAT threshold that voice requires. Try the same reasoning model on another provider and you'll see why this matters. ## Running It Start the agent: ```bash python agent.py ``` Pipecat will create a Daily room and print the URL. Open it in your browser, allow microphone access, and start talking. ## Production Considerations This tutorial gives you a working prototype. Here's what to think about for production. **Scaling concurrent sessions.** Each voice session needs its own pipeline instance. Daily and LiveKit both handle WebRTC scaling, but you'll need to manage pipeline instances. Consider running each session as a separate process or using an orchestrator. **Model routing.** In production, you probably want a mix of models. Simple queries ("what are your hours?") go to a fast 8B model. Complex queries ("I was charged twice and my refund was applied to the wrong account") get routed to a reasoning model. General Compute serves multiple model sizes, so you can route dynamically based on query complexity. **Function calling.** For real applications, you'll want the agent to do things: check a calendar, look up an order, book a reservation. General Compute's API supports function calling, so you can add tools to the LLM step and the agent will call them as part of the conversation. **Phone integration.** For phone-based agents, swap Daily for Twilio as your transport layer. The rest of the pipeline stays the same. **Persistent memory.** For multi-turn conversations that span sessions, store the message array to a database keyed by user or session ID and reload it when they come back. ## Why This Matters The voice AI space is growing fast, but almost every company in it is constrained by their inference provider. They're all using standard chat models because reasoning models are too slow on available infrastructure. They're all designing around the same latency limitations. General Compute removes that constraint. You get fast enough inference to use the best models available, including reasoning models, while staying within the tight latency requirements that voice demands. That means your voice agents can be both fast and smart, which is a combination that wasn't previously available. --- The full code from this tutorial works out of the box with a General Compute API key. [Sign up at generalcompute.com](https://generalcompute.com), grab your key, and you can have a working voice agent running in about 15 minutes. --- # How Coding Agents Depend on Inference Speed > Coding agents make dozens of sequential LLM calls per task. Every millisecond of inference latency compounds across each step, making speed the single biggest infrastructure bottleneck for AI-powered developer tools. - Author: General Compute - Published: 2026-03-19 - Tags: coding-agents, inference, developer-tools - Canonical: https://www.generalcompute.com/blog/how-coding-agents-depend-on-inference-speed --- OpenAI just signed a $10 billion, 750-megawatt deal with Cerebras to run Codex, their coding agent, on Cerebras' inference chips. The largest AI company in the world looked at their coding agent product and decided that general-purpose GPU infrastructure wasn't fast enough. They needed specialized hardware built for inference speed. That decision tells you everything about what matters for AI coding tools. Coding agents aren't chatbots. They don't make a single API call and return a response. They run multi-step loops: read code, reason about it, write a fix, run tests, check the results, and iterate. A typical task might involve 8 to 15 sequential LLM calls, and each one blocks the next. When every call in that chain is slow, the delays compound. And when they compound enough, the tool goes from feeling like a collaborator to feeling like something you're babysitting. ## What Happens Inside a Coding Agent To understand why speed matters so much, you need to understand what a coding agent actually does when you give it a task like "fix this failing test" or "add pagination to this endpoint." The loop looks something like this: 1. The agent reads the relevant files and reasons about the codebase. 2. It plans an approach (sometimes explicitly, sometimes implicitly through chain-of-thought). 3. It calls tools: reads files, searches for references, inspects error output. Each tool call requires an LLM inference to decide what to do next. 4. It generates code changes. 5. It validates by running tests, linters, or type checkers. 6. It reads the results and decides whether to iterate or finish. Steps 2 through 6 repeat multiple times per task. Some of these LLM calls are short (tool selection, classification) and some are long (code generation, planning), but they're all on the critical path. Nothing can happen in parallel because each step depends on the output of the previous one. A typical coding agent task involves 8 to 15 of these sequential calls. More complex tools like Devin or SWE-Agent can run 50 to 100+ steps for a single task. ## The Compounding Latency Problem Here's where the math gets uncomfortable. If each LLM call takes 2 seconds and a task requires 12 calls, that's 24 seconds of pure inference time, not counting tool execution. At 500ms per call, the same task takes 6 seconds. At 200ms per call, it's 2.4 seconds. | Latency per call | 10 steps | 15 steps | 25 steps | |---|---|---|---| | 2,000ms | 20s | 30s | 50s | | 500ms | 5s | 7.5s | 12.5s | | 200ms | 2s | 3s | 5s | The difference between the top and bottom row of that table is the difference between a tool that developers actually use and one they disable after a week. This is fundamentally different from a chatbot, where you make one call and wait. With agents, latency doesn't just add up linearly. It determines whether the entire workflow is practical. A 25-step agent running at 2 seconds per call takes nearly a minute of inference time alone. Most developers won't wait that long. They'll just do it manually. ## Both TTFT and TPS Matter (For Different Reasons) Coding agents make two kinds of LLM calls, and each one cares about a different speed metric. **Short calls (tool selection, classification, small edits):** These are latency-sensitive. The model needs to quickly decide which file to read or which tool to call. Time-to-first-token (TTFT) dominates here because the total output is small. A high TTFT means the agent sits idle for hundreds of milliseconds before it even starts generating a one-line response. **Long calls (code generation, planning, large refactors):** These care more about tokens-per-second (TPS). The model is generating 50 to 500 tokens of code, and TPS determines how long that takes. Slow TPS means watching code appear character by character in your editor. Coding agents need both metrics to be fast. An inference provider that has great TPS but slow TTFT (or vice versa) will still feel sluggish for agentic workloads. ## Why OpenAI Moved Codex to Specialized Inference Hardware The OpenAI-Cerebras deal is worth paying attention to because of what it signals about infrastructure requirements for coding agents. OpenAI has access to more GPU compute than almost any other company on the planet. They have massive clusters of NVIDIA hardware. And yet, when it came to running Codex at the speed and scale their coding agent needed, they went outside their existing infrastructure to a company that builds specialized inference chips. The deal is $10 billion and 750 megawatts of power capacity. This is not a small experiment or a pilot program. This is OpenAI making a serious long-term bet that coding agents specifically need inference infrastructure that's faster than what standard GPU setups can deliver. The reasoning is straightforward when you understand the agentic loop. Codex doesn't just generate code. It reads files, plans approaches, calls tools, writes code, validates results, and iterates. Each step is a sequential inference call. The total user-facing latency is the sum of all those calls plus tool execution time. When your product's core experience depends on a loop of 10 to 20 LLM calls completing fast enough to feel interactive, the speed of each individual call becomes your most important infrastructure constraint. This is the same dynamic playing out across the coding agent space. Cursor chose Fireworks specifically for low latency. Every serious coding tool company treats inference speed as a first-class infrastructure requirement, not an afterthought. ## Why Standard Infrastructure Falls Short Most cloud GPU providers and inference APIs are optimized for throughput (serving many requests efficiently) rather than latency (serving individual requests fast). These are different optimization targets, and they often conflict. The specific problems coding agents hit: **Queuing delays.** Shared inference services process requests in batches. When the system is under load, your request sits in a queue before it starts executing. This adds unpredictable latency that compounds across agent steps. **Cold starts.** Serverless GPU providers sometimes need to load models into memory when a request arrives. This can add seconds of latency to the first call, which is exactly when the user is watching. **Batching tradeoffs.** High-throughput providers batch multiple requests together for GPU efficiency. This is great for aggregate throughput but increases latency for individual requests, which is what matters for interactive agents. **Inconsistent tail latency.** P50 (median) latency might look fine, but agents make many sequential calls per task. If your p99 latency is 3x your p50, the agent will regularly hit a slow call somewhere in its loop, and the user will notice. What coding agents actually need from their inference provider: - Consistently low TTFT (under 200ms) - High tokens-per-second for code generation - Low p99 latency, not just low median - Always-warm models with no cold starts - Support for long context windows (codebases are large) ## Speed Determines Developer Experience There's a well-documented relationship between tool latency and developer productivity. Research on developer flow states shows that interruptions longer than about 10 seconds break concentration. A coding agent that takes 30 seconds per task doesn't just feel slow. It actively disrupts the developer's workflow. Speed also determines trust. Developers adopt tools that feel responsive and abandon tools that feel laggy. GitHub Copilot's initial success was partly about model quality, but it was also about the fact that inline completions appeared almost instantly. The speed was part of what made it feel like the tool understood what you were writing. There's also a cost argument that's easy to miss. Faster agents often produce better results because the developer can course-correct sooner. If the agent takes 5 seconds per loop, the developer can spot a wrong direction after one or two iterations and redirect. If it takes 30 seconds per loop, they've already wasted a minute before they realize the agent is going down the wrong path. ## This Problem Gets Worse, Not Better The trend in AI-powered development tools is toward more autonomy, not less. Agents are taking on larger, more complex tasks that require more steps. Multi-agent architectures where a planner, coder, reviewer, and tester collaborate on a task multiply the number of inference calls further. Background agents that run tasks asynchronously (like Cursor's background agent or Devin) still need to finish in minutes, not hours, to be useful. A 100-step agent at 2 seconds per step takes over 3 minutes of inference time. At 200ms per step, it takes 20 seconds. Longer context windows are also becoming standard. As models handle 128K to 1M+ token contexts to ingest entire codebases, maintaining speed at those context lengths becomes a harder engineering problem. Providers that can't serve long-context requests fast enough will be unusable for the next generation of coding tools. ## What Fast Inference Makes Possible When inference is fast enough, coding agents can do things that aren't practical on slower infrastructure. They can run more iterations per task, trying multiple approaches and picking the best one. They can include validation steps (run tests, check types, lint) inside the loop without making the total time unacceptable. They can use reasoning models that think through complex problems step by step, where the thinking overhead would normally make the agent too slow. At General Compute, our infrastructure is built specifically for these workloads. Low TTFT, high TPS, consistent tail latency, and always-warm models. The difference shows up directly in how coding agents perform: more steps per second, faster task completion, and a developer experience that feels responsive rather than something you wait on. --- If you're building or deploying a coding agent, the inference provider you choose determines the ceiling of your tool's performance. [Try General Compute](https://generalcompute.com) and benchmark it against what you're currently using. The compound effect across a multi-step agent loop is where the difference really shows up. --- # Why Inference Speed is the New Moat > Model quality has commoditized. The real competitive advantage in AI is how fast your infrastructure can deliver results. Inference speed is becoming the defining moat for AI-native products. - Author: General Compute - Published: 2026-03-18 - Tags: inference, infrastructure - Canonical: https://www.generalcompute.com/blog/why-inference-speed-is-the-new-moat --- A voice AI assistant that takes 2 seconds to respond feels like talking to a call center IVR. One that responds in 200ms feels like talking to a person. The model behind both of them might be identical. The difference is the inference. The AI industry spent 2022 through 2024 in an arms race over model quality. GPT-4 vs. Claude vs. Gemini vs. Llama. That race produced incredible models, and it also reached a point of diminishing returns for most production use cases. The top five models are now roughly interchangeable for the majority of real-world tasks. The new competitive advantage is speed. ## Model Quality Has Plateaued (For Most Use Cases) This would have been a controversial claim two years ago, but it's increasingly obvious: for the majority of production AI applications, model quality is no longer the bottleneck. The open-source model explosion (Llama 4, Qwen 3, DeepSeek R1 and V3, Mistral, Gemma) has closed the gap with proprietary models to the point where the difference between the top five is invisible to end users for tasks like chatbots, summarization, code completion, and classification. On the [LMSYS Chatbot Arena](https://lmarena.ai/) leaderboard, open-source models regularly trade places with proprietary ones in human preference rankings. When multiple models can do the job well enough, the question changes. It goes from "which model is smartest?" to "which one can deliver that intelligence to my users fastest?" ## The Concept of Latency Debt Technical debt is the compounding cost of shipping messy code you'll eventually have to clean up. Latency debt works the same way, but it compounds across your entire AI stack and is harder to notice. Latency debt is the cumulative cost in user experience, conversion rates, product capability, and engineering complexity that builds up when your inference is slower than it should be. It compounds in three ways. **UX debt.** Users tolerate about 200 to 500ms for interactive AI responses. Beyond that, engagement drops measurably. Google's research showed that a 500ms increase in search latency caused a 20% drop in traffic. Amazon found that every 100ms of additional latency cost roughly 1% in revenue. If users abandon web pages at 3 seconds, imagine what happens to an AI chatbot that takes 8 seconds to respond. **Architecture debt.** Slow inference forces your engineering team into workarounds. You add caching layers. You pre-compute responses. You use smaller, weaker models. You batch requests instead of streaming. You flatten your agent pipelines to avoid multi-step calls. None of these are decisions you'd make if inference were fast. They're concessions to a constraint you've accepted. **Opportunity debt.** This is the most insidious form. Entire categories of applications become impossible when inference is too slow. You can't build real-time voice AI, responsive coding agents, or interactive game NPCs on 2-second inference. You don't build features you know will feel broken, so you never discover what your product could have been. The worst part is that teams often don't realize they're paying this tax. They've never experienced truly fast inference, so they assume the limitations are inherent to the technology. ## Speed Enables Entirely New Application Categories Below certain latency thresholds, new kinds of applications become possible. Speed doesn't just make existing apps better. It makes new ones feasible. ### Voice AI and Conversational Agents Human conversation has a natural turn-taking cadence of about 200 to 300ms. AI voice agents need to match this to feel natural. The growth of voice AI startups like Vapi, Bland, Retell, and OpenCall is gated by one thing: how fast the LLM in their pipeline can respond. The pipeline is simple: speech-to-text, then LLM inference, then text-to-speech. The LLM step typically accounts for 50 to 70% of total latency. If you cut time-to-first-token from 400ms to 80ms, the entire pipeline goes from "awkward pause" to "natural conversation." Every 100ms of added inference latency makes a voice agent feel measurably less human. ### Coding Agents and Developer Tools Coding agents like Cursor, GitHub Copilot, and Claude Code don't make a single API call per task. They run multi-step loops: read code, reason about it, write a fix, run tests, check results, iterate. A typical task might involve 8 to 15 sequential LLM calls. The math here is straightforward. At 2 seconds per call with 10 steps, that's 20 seconds of waiting. At 500ms per call, it's 5 seconds. At 200ms, it's 2 seconds. The fast version feels like working with another engineer. The slow version feels like waiting for CI to finish. Cursor's team has been vocal about latency being their top infrastructure priority, sometimes even above model quality. They'll use a slightly less capable model if it's significantly faster, because developer experience falls apart quickly with added lag. ### Real-Time and Interactive AI AI in gaming (NPC dialogue), robotics (real-time decisions), financial services (market analysis), and live content moderation all require sub-second inference. These aren't niche use cases. They represent some of the highest-value applications of AI. Any workflow that chains multiple LLM calls is multiplicatively affected by per-call latency. A pipeline with five sequential calls where each takes 2 seconds adds up to 10 seconds, which is unusable for anything interactive. Below roughly 200ms time-to-first-token, users perceive AI as instant. That's the bar infrastructure needs to clear. ## Speed Creates Compounding Business Advantages Faster inference creates advantages that stack over time and are hard for competitors to replicate. **Network effects.** Faster inference leads to better UX, which leads to more users, which generates more data for optimization, which feeds back into faster inference. This flywheel is real and it favors teams that invest in speed early. **Switching costs.** Once a product is built around fast, multi-step inference (real-time voice, agentic coding, interactive search), migrating to a slower provider means re-architecting the product. Speed becomes load-bearing infrastructure that's expensive to replace. **Cost efficiency.** This is counterintuitive, but faster inference can actually be cheaper per query. Purpose-built infrastructure achieves higher hardware utilization, which means more tokens per second per dollar. Speed and cost efficiency aren't always tradeoffs. With the right infrastructure, they're complementary. **Market signals.** The industry is voting with its feet. Groq captured massive developer attention purely on speed. Fireworks AI partnered with Cursor specifically because of low latency. Together AI, Cerebras, and others are all competing on tokens-per-second. The market has made it clear: speed wins. ## Why Custom Infrastructure Matters Running inference on general-purpose cloud GPU instances leaves a lot of performance on the table. AWS, GCP, and Azure are optimized for flexibility, not for making inference as fast as possible. Purpose-built inference infrastructure looks different: - GPU configurations and networking optimized specifically for inference workloads, not training - Custom kernel-level optimizations for the decode path - Inference-specific serving with aggressive memory management - Always-warm models with no cold starts - Geographic distribution for consistently low latency This is the approach we've taken at General Compute. Our infrastructure is designed to deliver inference as fast as the hardware allows. The result shows up in benchmarks, but more importantly, it shows up in the products people build on top of it. ## Looking Ahead We're moving toward a future where inference speed stops being a constraint entirely. When that happens, a few things change. Agents become truly autonomous. Multi-step workflows that currently take minutes will finish in seconds, enabling agents that can run 50-step tasks while you watch. AI-native interfaces start replacing traditional UIs. When AI can respond as fast as a database query, there's less reason to pre-render static screens for every possible interaction. Reasoning models reach their potential. Models like DeepSeek R1 and Qwen QwQ spend more compute at inference time to produce better answers. Faster inference means more reasoning per second, which directly translates to smarter outputs. The companies that are investing in inference speed now aren't just optimizing a metric. They're building the infrastructure that the next generation of AI applications will run on. --- If you're building real-time AI, whether it's voice agents, coding tools, or agentic workflows, your inference provider is your bottleneck. [Try General Compute's API](https://generalcompute.com) and see what your product feels like when inference is fast.