Back to blog
agentsstreaminginferencelatencyuxtool-calling

Streaming for Agents: Why Partial Results Change the UX

General Compute·

Streaming, in the chat product sense, is a solved idea. The model emits tokens one at a time, the client appends them to a textarea, and the user reads along while the model is still thinking. The win is psychological: nothing is faster, but the wait feels shorter, and a partially-rendered answer is sometimes enough to let the user decide the answer is wrong and stop the generation.

Streaming in an agentic pipeline is a different problem. An agent is not just writing prose to a screen. It is calling tools, parsing structured output, deciding what to do next, and sometimes passing intermediate state to another model call. The "user" of the stream is often another piece of code, not a human eye. Once you accept that, a lot of options open up that chat-style streaming never needed: streaming a tool call's arguments while it is still being decoded, pipelining a downstream step against an upstream one, cancelling early when a partial result is already enough, and surfacing structured progress to the human watching the agent run.

This post is about what changes when you stream in an agentic system instead of a chat one, and why partial results are worth the complexity.

What "streaming" means in different layers

Inference servers stream tokens. That part has not changed. The OpenAI-style stream: true flag still produces a sequence of server-sent events, each carrying a delta. Anthropic's streaming format does the same thing with a different schema. Most other vendors follow one of these two shapes.

What has changed is what an agent does with those events.

In a chat product, the consumer of the stream is a renderer. It concatenates deltas, runs them through a markdown parser, and paints them on a screen. The agent layer is invisible because there is no agent layer.

In an agentic system, the stream feeds at least three different consumers, often at the same time:

  • A user interface, if there is a human watching, which wants something human-readable to display.
  • A tool dispatcher, which is watching for the model to emit a tool call so it can start executing it.
  • An orchestrator, which is deciding whether the model's output is good enough to move to the next step or whether it should be cancelled and retried.

Each of these consumers has a different definition of "useful partial result." The renderer wants tokens. The tool dispatcher wants a complete function name and a parseable arguments object. The orchestrator wants enough output to evaluate confidence.

A well-built agent treats the stream as a multi-consumer event source, not as a string that is slowly getting longer.

Partial tool calls

The most interesting use of streaming in agents is starting tool execution before the tool call is fully decoded.

When a model emits a tool call, it does not produce the function name and arguments atomically. It generates them as text, like everything else. The function name comes out token by token, then the arguments, which are usually JSON. With current models and current serving stacks, this can take anywhere from 50 ms to several seconds, depending on argument length.

If your agent waits for the full tool call before doing anything, you are paying that decoding time twice: once for the model to finish generating, and again later when the tool actually runs. If the tool itself is slow (a web search, a database query, a code execution sandbox), the user is waiting end-to-end.

There are two patterns that recover some of this time.

The first is speculative dispatch on function name. As soon as the function name is decoded but the arguments are still streaming, you can warm up the tool: open a database connection, load the model needed for the tool, fetch credentials. None of this depends on the arguments. By the time the arguments are fully decoded, the tool is already primed.

The second is partial-argument execution for tools that allow it. A web search tool whose argument is a query string can start tokenizing and embedding the query as soon as the first few tokens of the query are decoded. If the model decodes "query": "fastest open source LLM" character by character, you can begin the search index lookup at "fastest open source" and refine when the rest arrives. For tools where the partial result is wrong but cheap to compute, this is worth it. For tools where partial input is destructive (a write, an email send, a payment), do not do this.

Both patterns require the inference server to actually stream tool call deltas. Some serving stacks do, some do not. The OpenAI Chat Completions API has supported tool call deltas for a while now, and Anthropic's streaming format includes incremental input JSON deltas for tool use blocks. If you are running open models behind vLLM or SGLang, check that the tool-calling parser is configured to emit deltas, not whole calls.

Pipelining agent steps

A multi-step agent looks like a small DAG. Step 1 produces output. Step 2 consumes that output and produces its own. Step 3 consumes step 2's output. In the simplest implementation, step 2 waits for step 1 to finish, step 3 waits for step 2, and the user waits for the whole chain.

When step 1 streams, you can sometimes start step 2 earlier. The catch is that you need to know which prefix of step 1's output is enough.

Consider a plan-then-execute agent. Step 1 produces a numbered list of subtasks. Step 2 is "for each subtask, dispatch a worker". If step 1 streams its plan, step 2 can start dispatching workers as soon as the first numbered item finishes streaming, without waiting for the whole plan. This is straightforward when the output structure is line-oriented.

It gets harder when the downstream step needs to reason globally about the upstream output. A summarizer that picks the three most important items from a list of ten cannot start until it has seen all ten. A coder that writes a function based on a spec cannot start before the spec is complete. For those cases, streaming saves the user-visible latency for the first step but does not pipeline anything underneath.

The pattern worth borrowing from systems work: treat each agent step as having a "minimum prefix" that downstream consumers depend on. If a downstream consumer can run with a prefix, run it on the prefix. If not, do not pretend that streaming helps; it just gives the user something to look at.

Streaming structured output

Most production agents output some kind of structured data. JSON, YAML, function arguments, structured tool calls. The naive approach is to wait for the whole blob, parse it, and act. With streaming, you can do better, but partial JSON is its own problem.

Partial JSON is not valid JSON. {"name": "ali is not parseable. There are a few approaches that work in practice.

The first is a partial JSON parser that builds up the tree as tokens arrive and exposes the latest valid prefix. Libraries like partial-json for TypeScript and the equivalent in Python implement this. When you ask for the parsed object, you get the deepest interpretable structure: missing keys are omitted, unterminated strings are surfaced as-is. You can poll this on every delta and decide whether enough of the structure is present to act.

The second is constrained decoding at the model level. If you have control over the inference stack, you can constrain the model to emit valid JSON token by token, with grammar enforcement (xgrammar, llguidance, outlines, lm-format-enforcer). At every step, the output is well-formed, which means the partial JSON parser does not have to handle most edge cases. This also tends to be faster, since the model is not wasting tokens on syntax recovery.

The third is to stream the keys in a known order, so the consumer can rely on positional structure. If your function signature guarantees that query is always emitted before filters and filters before limit, your consumer can act on query the moment the next key appears, without rebuilding the object from scratch.

The combination of constrained decoding and a partial parser is the production-grade choice. It is rarely the default; you have to set it up.

Cancellation and correction mid-stream

Streaming makes it possible to stop a generation that is going wrong. In chat, this is the user smashing the stop button. In an agent, the orchestrator can do the same thing, automatically, when a partial result is evidence the model is heading off the rails.

A few examples where this is worth doing:

  • The model emits a tool name that does not exist. Stop, do not let it finish hallucinating arguments. Retry with a tool-listing hint.
  • The model starts emitting a long reasoning chain in a step that was supposed to be a one-shot answer. Stop, retry with a stricter system prompt.
  • The model's confidence proxy (token logprobs, a classifier on the partial output, a small validator model) drops below a threshold. Stop, escalate to a larger model.

This is only useful if cancellation is cheap. With most inference servers, you cancel by closing the SSE connection, and the server stops generating shortly after. The exact behavior depends on the server: vLLM and SGLang both honor client disconnects in recent versions, but the latency from disconnect to actual stop varies. Production agents with cancellation logic should measure this on their own stack rather than trusting docs.

Cancellation also matters for cost. A model that is generating 2000 tokens of reasoning before noticing it has the wrong tool is wasting both wall time and money. An orchestrator that watches the stream and cancels at token 200 saves both.

Streaming progress to humans

A long-running agent that takes 30 seconds to a few minutes per task has a UX problem that chat does not. The user is staring at a spinner. They do not know if the agent is making progress, stuck in a loop, or about to produce something useful. Streaming the agent's internal state to the user is a partial fix.

The pattern that has emerged in coding agents and research agents is to surface a structured progress event for each agent step: which tool is being called, what the current plan looks like, what files have been touched. This is not the same as token streaming; the events are higher-level. But they typically piggyback on the same underlying connection, and they are only possible because the model is streaming its decisions, not batching them into one final answer.

If you are building an agent UI, the question to ask is not "should I stream tokens?" but "what is the smallest useful unit of progress I can show the user?" Sometimes that is a token. Sometimes it is a tool name. Sometimes it is "step 3 of 7 complete." A mix of all three, with the right one chosen for the right step, is what feels responsive.

Where inference speed matters most

Faster per-token inference helps streaming agents in three different places.

First, it shortens the absolute time from the start of a step to the first useful partial result. If your agent uses partial JSON parsing to start a downstream tool early, faster decoding means that downstream tool starts earlier in wall time.

Second, it makes cancellation cheaper. If a step that turns out to be wrong takes 500 ms instead of 5 seconds, the cost of a cancelled generation drops by an order of magnitude. Cancellation-based retry strategies are only viable when retries are fast.

Third, it changes the design space of multi-step agents. When each LLM call is a fraction of a second, you can afford more steps, more tool calls, more validation passes, all happening with streaming pipelines between them. The agent stops looking like a sequence of slow blocking calls and starts looking like a real pipeline, with each stage running concurrently with the next.

This is the angle that matters for production work. Streaming is not a UI trick to make a chat product feel faster. It is the substrate that makes agentic pipelines compose well, and the faster the underlying inference, the more aggressive your pipeline design can be.

If you are running agents on a stack that streams tokens at hundreds or low thousands per second, the patterns in this post are options. On a stack that runs at tens of tokens per second, most of them collapse back to "wait for the model to finish." That is the part worth measuring before you design your agent around streaming.

General Compute serves open models with very high tokens-per-second on an OpenAI-compatible API, including streaming tool call deltas. If you are building an agent and want to test what your pipeline looks like when streaming is fast enough to pipeline against, the API is at generalcompute.com.

ModeHumanAgent
Streaming for Agents: Why Partial Results Change the UX | General Compute