ReAct, Reflexion, and Chain-of-Thought: The Inference Cost of Reasoning Patterns

If you read enough agent papers, you start to notice a pattern. Each one introduces a "method" with a name (ReAct, Reflexion, Tree-of-Thought, Self-Refine, Chain-of-Verification), a clever prompt template, and a benchmark table showing the new method beats the old method by a few points. The methods are described as prompting techniques. The benchmarks rarely report wall-clock latency, and almost never report total tokens generated per task.

In production, those numbers are the whole story. A reasoning pattern is not really a prompt template. It is a recipe for how many model calls a single user request will fan out into, how long the chain of dependencies between those calls runs, and how many tokens get generated on the way to the answer the user actually sees. The differences between Chain-of-Thought, ReAct, and Reflexion are not subtle from an infrastructure perspective. They are 2x, 5x, and 20x.

This post walks through what each of these patterns actually does at the inference layer, where the cost goes, and why the choice of reasoning pattern is one of the highest-leverage architectural decisions in any agent system.

Chain-of-Thought: a single call with extra tokens

Chain-of-Thought is the simplest of the three, and the cheapest. The prompt asks the model to "think step by step" before answering, and the model produces a stretch of intermediate reasoning followed by a final answer. There is one inference call. The user sees one response. The only cost over a non-CoT call is the extra output tokens for the reasoning trace.

That extra cost is real but bounded. A typical CoT trace adds 100 to 500 output tokens to a response that might otherwise have been 50 tokens. On a model that decodes at 100 tokens per second, that translates to one to five extra seconds of latency. On a model that decodes at 1,000 tokens per second, the user barely notices.

Two things make CoT work well in production. First, the entire reasoning happens in a single decode pass, which means the KV cache is reused across the whole trace and the model never has to re-read its own intermediate output. Second, you can stream the output to the user, so even though the total response is longer, time-to-first-token is unchanged. If your UI is set up to show "thinking" indicators while the model produces its trace, the perceived latency can actually be better than a terse non-CoT response, because the user sees progress.

The trap with CoT is that it is so cheap teams stop noticing the extra tokens. A 200 token reasoning trace on every customer-facing call adds up to real money at scale, especially if the trace is mostly boilerplate ("Let me think about this carefully. The user is asking about..."). Periodic audits of how much of your output token spend is reasoning versus answer are worthwhile.

ReAct: interleaving thought, action, and observation

ReAct (Yao et al., 2022) extends Chain-of-Thought by interleaving reasoning with tool use. The model alternates between Thought, Action, and Observation steps. It produces a thought, decides on an action (a tool call), the tool runs, the observation gets fed back into the prompt, and the model produces the next thought. This continues until the model decides it has enough information to answer.

At the inference layer, this pattern is fundamentally different from CoT. Each Thought-Action cycle is a separate model call. The model emits a Thought and an Action, generation halts at the Action token boundary, the tool runs, and then a new prompt is constructed that includes the original prompt plus all prior Thoughts, Actions, and Observations. That new prompt gets fed back into the model for the next step.

The cost structure has three components.

First, the prompt grows linearly with the number of steps. Each round adds the prior Thought, the Action call, and the Observation (which is often the largest piece, especially if the tool returns search results or document chunks). After five steps, the prompt might be 10,000 tokens longer than it started. By step ten, the input cost dominates the output cost by a wide margin.

Second, every step pays a fresh time-to-first-token. The model has to prefill the entire growing prompt on each round. Prefix caching helps if the inference stack supports it well and the framework constructs prompts in a stable order, but a non-trivial fraction of agent frameworks build prompts in ways that defeat caching (timestamps in system prompts, randomized example orderings, dynamically reordered tool descriptions). When caching fails, each step pays full prefill latency on a longer and longer prompt.

Third, the steps are sequential by construction. ReAct does not parallelize. The Action in step N depends on the Observation in step N-1, which depends on the Action in step N-1. The total wall-clock time is the sum of every TTFT and every decode in the chain. A five-step ReAct loop on a model with 800ms TTFT and 100 tokens per second of decode, where each step generates about 80 tokens, takes roughly 5 * (0.8 + 0.8) = 8 seconds of pure model time, plus whatever the tools themselves take.

That is the gap between a paper's "ReAct improves on CoT by 4 points" and the production experience of "ReAct makes our agent feel sluggish." The benchmarks measure accuracy. The user feels the latency stack.

Reflexion: ReAct with retries

Reflexion (Shinn et al., 2023) adds a self-improvement loop on top of ReAct. After an attempt fails (or scores poorly on some self-evaluation), the agent reflects on what went wrong, produces a written critique of its own behavior, and tries again with the critique loaded into context. Some variants run several attempts and pick the best.

This is where the cost numbers stop being polite. A Reflexion agent that runs three attempts of a five-step ReAct chain, with a self-critique inference call between each attempt, is doing roughly 3 * 5 + 2 = 17 model calls for a single user task, and each ReAct chain inside the loop is paying the linear prompt growth described above. The critique step itself is often expensive because it has to read the entire failed trajectory, which by attempt three can be 20,000 tokens or more.

The latency is even worse than the token count suggests, because the attempts are sequential. You cannot critique a trajectory until it has finished. You cannot start the next attempt until the critique is done. A Reflexion run with three attempts and a five-step ReAct inner loop, on the same hardware as the example above, takes around 30 seconds of pure model time for the ReAct portions and another 5 to 10 seconds for the critique steps. Forty seconds before the user sees a final answer.

Reflexion was developed for benchmark settings where you can afford to run many attempts and pick the best. It is genuinely useful in domains like code generation where you can run a unit test between attempts and bail out as soon as one passes. It is brutal in domains where you have to run all attempts to evaluate them, or where the self-critique is itself unreliable and the agent talks itself out of correct answers.

Tree-of-Thought and friends: branching the cost

Tree-of-Thought (Yao et al., 2023) generalizes Chain-of-Thought into a search over reasoning steps. At each level, the model produces several candidate thoughts, an evaluator scores them, and the search expands the most promising branches. Variants like Graph-of-Thought, Algorithm-of-Thought, and Self-Consistency CoT are similar in shape.

The inference cost for ToT is the product of the branching factor, the depth, and the per-step cost. With a branching factor of three, a depth of four, and an evaluator call at each level, you are looking at 3^4 = 81 leaf nodes plus 40 internal evaluator calls in the worst case, even before pruning. Real implementations prune aggressively, but even a moderately aggressive search produces 20 to 30 model calls per task.

The good news for ToT is that the branches are independent within a level. With sufficient inference capacity, you can run all three children of a node in parallel. The bad news is that the evaluator step is a synchronization barrier: every branch at level N has to finish before the evaluator picks which to expand at level N+1. So even with parallelism, the wall-clock latency is the depth times the per-step latency, not just the per-step latency.

In practice, ToT tends to live in research papers rather than production agent stacks, because the cost structure is hard to justify outside of problems where the search tree genuinely matters (theorem proving, certain planning tasks). For most agent workloads, the gain from search-based reasoning is smaller than the gain from making any one inference call faster.

Self-Consistency: the cheap parallel cousin

Self-Consistency CoT (Wang et al., 2022) takes a different approach to using extra compute. Instead of adding sequential steps, it runs the same Chain-of-Thought prompt N times in parallel with sampling, and picks the answer that appears most often (majority vote). Five samples is typical.

The interesting property of Self-Consistency from an inference perspective is that all five calls are independent. They share the same input prompt (so prefix caching is effective if the serving stack supports it), they run fully in parallel, and the only synchronization is the final vote. Wall-clock latency is roughly the same as a single CoT call, plus a small overhead for the vote. The cost is 5x the tokens, but the user-visible latency is barely affected.

This makes Self-Consistency one of the few reasoning patterns that scales cleanly with inference capacity. If your serving infrastructure can handle the parallel calls without queueing, you get the accuracy bump without the latency penalty. If your provider's headroom is tight, the parallel calls serialize and you get full sequential cost.

Choosing a pattern is choosing an inference profile

The choice between these patterns is usually framed as a quality decision: which one gives the best accuracy on your benchmark? The infrastructure framing is at least as important.

Chain-of-Thought is essentially free in latency terms (especially with streaming) and adds bounded token cost. It should be the default for any task where you do not need tools.

ReAct is the standard for tool-using agents but it pays linear prompt growth and sequential latency. If your tools are slow, that is not where the latency comes from. The model calls themselves are. Faster inference, better prefix caching, and shorter system prompts all directly reduce ReAct latency.

Reflexion and other retry-based loops should be used selectively, in domains where the retry actually helps and the self-critique is reliable. The cost is high enough that you want it gated by an external signal (a failing test, a low confidence score, a user-visible error) rather than run by default.

Tree-of-Thought and other search-based patterns are expensive enough that they only make sense for problems where search structure is intrinsic to the task.

Self-Consistency is the cheapest way to spend more compute for more accuracy if you have the parallel inference capacity to support it.

The common thread across all of these is that the reasoning pattern is doing the same thing as the underlying inference layer is doing: trading compute for quality. The pattern decides how the compute is shaped (sequential versus parallel, long context versus many calls, structured output versus free-form generation), and the inference layer decides how fast each unit of compute actually runs.

When teams are unhappy with how their agent feels in production, the conversation tends to start at the prompt level, then move to the framework level, and only later get to the model and the inference layer. In our experience, the inference layer is usually where the largest single improvement is available, especially for ReAct-style patterns where sequential model calls dominate the latency budget. If you want to see how that plays out on your own agent traces, the OpenAI-compatible API is at generalcompute.com.