Agent Memory Systems: Balancing Context Length vs Retrieval Latency

An agent has no memory between turns. Each call to the model is stateless. Whatever the agent "remembers" about a session, a user, or its own past actions has to be reconstructed and placed into the prompt every time the model runs. How you reconstruct that memory is the agent memory system, and the design space has real latency consequences.

The three dominant approaches are long-context (put everything back into the prompt), retrieval-augmented generation (store memory in an index and pull out the relevant slices), and summarization (compress history into shorter blobs the model can reread). Each one trades a different axis: prefill cost, retrieval latency, information loss, and complexity. The right choice depends on the access pattern of the agent, not on which technique is currently fashionable.

This post walks through the four approaches that show up in real systems, attaches realistic latency numbers to each, and explains the hybrid pattern most production agents end up using.

What "memory" actually means in an agent

The word memory in this context covers three distinct things, and the literature often blurs them.

Conversation memory is what the agent has said and done in this session: prior tool calls, prior responses, the user's recent messages. It grows monotonically until something compacts it.

User memory is what the agent knows about the person it is helping. Preferences, prior projects, name, tone, things they have asked before across sessions. This is durable and typically lives in a database.

Working memory is what the agent has temporarily loaded for the current task: the file it is editing, the documentation page it pulled, the API spec it needs to follow. This is short-lived and per-task.

A memory system has to handle all three, and each one has a different latency profile. Conversation memory wants fast prefill caching. User memory wants fast vector retrieval. Working memory wants fast tool calls. A single design that ignores the difference will get one of them wrong.

Approach 1: Long context

The simplest approach is to send everything back to the model on every call. The conversation history, the relevant files, the user profile, all of it. Modern models with 128K or 200K context windows can technically hold a lot.

The cost is prefill. A 50,000 token context, processed cold, takes between 1.5 and 4 seconds depending on the backend, the model size, and whether prefill is chunked. If your agent runs 20 model calls in a task, paying full prefill on each is enough to ruin the experience by itself.

This is where prefix caching helps. If the front of the context is stable across calls (system prompt, fixed memory blocks, the conversation up to turn N), a serving stack with prefix caching only pays prefill cost on the new portion. Done well, this turns a 50K token prefill into a 2K token prefill, with the corresponding latency reduction.

The catch is that not every backend implements prefix caching well. Some only cache the system prompt. Some invalidate the cache on small differences. Some claim to cache but show a small fraction of the theoretical speedup. If your agent strategy depends on prefix caching, test it specifically with your prompt structure rather than trusting the marketing copy.

Long context also has a quality ceiling. Models attend to the middle of long contexts less reliably than to the ends. The "lost in the middle" finding from Liu et al. is well replicated and shows up in agent workloads as silently degraded recall. If your agent's reasoning depends on a fact buried at position 30,000 of a 60,000 token context, it may not reliably use that fact.

Approach 2: Retrieval-augmented memory

The RAG approach treats memory as an external store. You embed pieces of memory (past conversation turns, documents, user facts) into a vector database. At each turn, you query the database for the K most relevant items and inject them into the prompt. The model sees a much shorter context because only the relevant slices are present.

This sounds clean. In practice, two latencies matter.

The first is retrieval itself. A well-tuned vector search over a few million items returns in 20 to 80 milliseconds. A poorly tuned one, or one that uses a heavy cross-encoder reranker, can take 300 to 500 milliseconds. Add network round trips and the overhead of constructing the query embedding (another model call, usually) and you can spend 200 to 800 milliseconds on retrieval before the main model has even started.

The second is the lost prefix cache. Because the retrieved chunks change between calls, the prompt structure changes, which busts the cache. You save on total prefill tokens but pay prefill cost on what is left. If the retrieved memory adds 8K tokens of fresh prefill on every turn, you have not saved as much as you think compared to a long context with prefix caching.

RAG also has a quality problem that is specific to agents: relevance is measured against the current query, but agents need information that may be relevant to the next query, several steps ahead. A retrieval system tuned for single-turn QA will under-retrieve for an agent doing multi-step reasoning. Tuning retrieval for agent workloads is its own field, and the latency cost grows fast when you add reranking, hybrid retrieval, or multi-query expansion.

Approach 3: Summarization and compaction

The third approach periodically compresses the agent's history into shorter summary blocks. Once the conversation gets past some token threshold, an asynchronous job (or an inline call) summarizes the oldest turns and replaces them with a summary. The agent sees a stable context that grows slowly even when the underlying conversation runs for hours.

The latency profile here is interesting. Each summarization call is not free: it is itself a model call, usually with a long input and a moderate output. Done synchronously, it adds 2 to 4 seconds at the point of compaction. Done asynchronously, it adds nothing to the immediate response but creates a queue you have to manage and a window where the agent has both the raw history and the pending summary in memory.

Summarization is lossy. The summary contains less than the original, by design. The art is in choosing what to preserve. Most production systems preserve tool call results, decisions, and user statements, while compressing reasoning chains. This works most of the time and breaks the times it does not. If the agent's later step needed a specific fact that was discarded, the agent fails in a confusing way.

For agents running over long sessions, summarization is unavoidable. Context windows do not scale faster than user expectations for session length. The question is not whether to compact, but when and how aggressively, and which other approaches to combine it with.

Approach 4: KV cache as memory

A less-discussed option treats the KV cache itself as the memory medium. The prefill cost is the price you pay to load memory into the model's working state. If you can keep the KV cache resident, subsequent calls can skip prefill on the cached portion entirely.

This is what prefix caching does within a single conversation, but the same mechanism can extend further. Some inference stacks (SGLang's RadixAttention, vLLM's prefix caching, certain custom serving paths) maintain cross-request KV caches. If two requests share a prefix, the second reuses the first one's cache. For agent workloads where many sessions share a long system prompt and a memory block, this is a five to ten times speedup on TTFT for steps after the first.

The cost is memory pressure on the inference backend. KV caches are not small. A single 50K token cache on a 70B model can occupy a few gigabytes of GPU memory. Keeping many of them resident requires either a lot of headroom or a smart eviction strategy. Most public inference providers do not expose this level of control, so you cannot decide which sessions stay warm. Custom inference stacks can.

There is also a fragility to KV-cache-as-memory. Any change in the prefix (a single edit to the system prompt, a different memory block, a reordered tool list) invalidates the cache. Production systems that depend on this savings have to be careful about prompt stability, and that constraint propagates into how the memory system can update itself.

The hybrid pattern most agents converge on

Production agent systems usually combine all four approaches.

A short window of recent turns sits in the model's context verbatim. This is the conversation memory layer. Prefix caching handles the repeated work.

Older turns are summarized and stored as compressed blocks. The summary is included in the system prompt up to some token budget. This is the compaction layer.

Durable user memory and project knowledge live in a vector store. The agent retrieves a few relevant chunks per turn. This is the retrieval layer.

The KV cache, where available, holds the stable parts warm across calls. This is the infrastructure layer.

Each piece exists because the others have failure modes. RAG misses things, summaries lose details, long context degrades in the middle, and KV cache costs memory. The combination is more robust than any one piece, and it is also more complex.

The latency budget for a single agent turn under this design looks roughly like this:

| Component | Latency | |-----------|---------| | Vector retrieval | 50 to 200 ms | | Prompt assembly | 5 to 20 ms | | Prefill on uncached portion | 100 to 400 ms | | Decode | 1 to 5 s | | Total | 1.2 to 6 s |

The decode usually dominates the total. If your retrieval system is slow, or your prefix cache is not actually hitting, or your summarization is happening synchronously, the budget breaks at the link that is leaking. Running an agent in production is largely the work of finding and fixing those leaks.

What inference speed does to the design space

Faster inference changes the relative weights of these approaches. With slow inference, summarization is painful because each summarization call is itself slow, so you avoid it and lean on retrieval. With fast inference, you can summarize more aggressively because the cost is lower, and you can also tolerate longer contexts because the prefill is faster.

The same logic applies to the cost side. If a model call costs ten cents in latency-adjusted user value, you avoid extra calls. If it costs one cent, you spend liberally on memory hygiene calls (summarization, reranking, multi-query retrieval) because they make the final answer better. Memory design becomes an optimization problem in dollars per quality point, and the optimum shifts when inference gets cheaper or faster.

This is one of the reasons fast inference matters more for agents than for chat. A chat application has one inference call per user turn, and the user is forgiving. An agent has many calls per task, and the user is watching the wall clock. Speeding up the model by a factor of three does not just speed up the response. It changes what is feasible to do inside the loop.

Picking the right combination for your agent

A few rules of thumb from looking at real systems.

If the agent's sessions are short (under five turns) and the user expects near-instant responses, lean on long context with aggressive prefix caching. Skip RAG until you have evidence you need it. Skip summarization entirely. The complexity is not worth it at that session length.

If the agent runs for many turns per session (a coding agent, a long support conversation, a research assistant), you need compaction. Run summarization asynchronously when you can, synchronously at well-defined breakpoints when you cannot. Budget for two to four seconds of summarization latency every N turns and design the UX around it.

If the agent draws on a large durable knowledge store (documentation, user records, prior conversations across sessions), you need retrieval. Spend the time to tune retrieval for your access pattern, including a reranker if the budget allows. Treat retrieval latency as a first-class metric, not an afterthought.

If the agent serves many users with shared structure (the same system prompt, the same tool definitions, the same boilerplate), invest in a serving stack with cross-request prefix caching. The savings compound across users.

The trap to avoid is adopting all four approaches reflexively because some blog post said agents need them. Each one has a latency cost. Each one adds failure modes. The simplest design that meets your access pattern is the right one, and you can add complexity when you measure that you need it.

Where General Compute fits

We run inference for agent workloads where memory churn is part of the access pattern. Prefix caching, fast prefill on growing contexts, and steady decode throughput are not optional in this setting. They are the things that decide whether the agent's memory layer can be designed for quality or has to be designed around the limits of a slow backend.

If you are building an agent and the memory system is the part that feels expensive or fragile, the inference layer underneath it is doing more to shape the design than people usually credit. Our API is OpenAI-compatible and tuned for agent-shaped workloads. If your latency budget on memory operations is breaking the experience, that is the kind of problem worth bringing to us.