AI Agents & Agentic AI

Tool calling, multi-agent architectures, reasoning patterns, and the inference requirements behind production agents.

agentsstreaminginferencelatencyuxtool-calling

Streaming for Agents: Why Partial Results Change the UX

Streaming in agentic pipelines is not the same as streaming chat tokens. Partial tool calls, pipelined steps, and early cancellation change what the user experiences.

General Compute·May 16, 2026

agentstool-callingparallelisminferencelatency

Parallel Tool Execution: How Fast Inference Enables Concurrent Agent Actions

Why running multiple tool calls in parallel changes the latency math of an agent, and how inference speed determines whether the parallelism is worth doing.

General Compute·May 15, 2026

agentsmemoryragkv-cacheinferencelatency

Agent Memory Systems: Balancing Context Length vs Retrieval Latency

How agents reconstruct memory between turns, and the latency trade-offs between long context, RAG, summarization, and KV cache reuse.

General Compute·May 12, 2026

coding-agentsinferencelatencydeveloper-toolsagents

Building a Code Agent: Why Each Step Needs Sub-Second Inference

A practical breakdown of the latency budget inside a code agent, step by step, and why every link in the chain needs to land under a second to keep the loop usable.

General Compute·May 11, 2026

agentsreasoningreactreflexionchain-of-thoughtinferencelatency

ReAct, Reflexion, and Chain-of-Thought: The Inference Cost of Reasoning Patterns

Popular agent reasoning patterns are described as prompt techniques, but they are inference cost multipliers. Here is how ReAct, Reflexion, and Chain-of-Thought actually shape the bill and the latency.

General Compute·May 10, 2026

agentsmulti-agentinferencelatencycost

Multi-Agent Architectures and the Inference Cost Explosion

Orchestrator and worker patterns make multi-agent systems easy to design and expensive to run. Here is where the inference cost actually goes, and what it means for the infrastructure underneath.

General Compute·May 9, 2026

agentstool-callinginferencelatency

Tool Calling Latency: The Bottleneck No One Talks About

Function calling looks simple on paper, but the latency budget of a tool-using LLM is dominated by short structured generations that most serving stacks are not optimized for. This is what actually makes tool calls feel slow.

General Compute·May 8, 2026

agentsinferencelatency

The Agentic Inference Tax: Why Agents Need 10x Faster Models

Agents make many sequential LLM calls per task, and each one pays the full latency of decoding. This post walks through how that compounds and why fast inference changes which agents are even viable.

General Compute·May 7, 2026