We raised $15M to build the world's fastest neocloud.Read
agentsinferencemulti turn conversationllm apimemory

Multi-Turn Conversations in LLM APIs: Best Practices for Agents

General Compute·

Most LLM tutorials show a single-turn request: send a prompt, get a response, done. Production agents don't work that way. A coding assistant, a customer support bot, or an autonomous research agent all maintain conversation state across many turns. That state management is where most production bugs live.

This post covers how multi-turn conversation works at the API level, what goes wrong as conversations grow, and the practical patterns for keeping agents reliable and cost-efficient over long sessions.

How Multi-Turn Works at the API Level

OpenAI-compatible APIs (including General Compute's) represent conversation history as a list of message objects, each with a role and content. You pass this entire list with every request:

messages = [ {"role": "system", "content": "You are a helpful coding assistant."}, {"role": "user", "content": "How do I reverse a string in Python?"}, {"role": "assistant", "content": "You can use slicing: `s[::-1]`."}, {"role": "user", "content": "What about in JavaScript?"}, ] response = client.chat.completions.create( model="qwen3-coder", messages=messages )

The model has no memory between API calls. You're responsible for assembling and sending the history each time. This is simple when conversations are short, but it means every token in that history list counts against your context limit and shows up in your bill.

The token math is straightforward: if you have 40 turns of 200 tokens each, your history alone is 8,000 tokens before you've written a single new prompt. At 100 turns, you're looking at 20,000 tokens of overhead per request, and if that's in a tight loop, costs compound fast.

The Context Window Ceiling

Every model has a maximum context length. Exceed it and the API returns an error. Common limits:

  • 8K--32K tokens: older or smaller models
  • 128K tokens: most current production models
  • 1M+ tokens: extended-context models (Llama 4, Gemini 1.5)

Even with a 128K window, a long-running agent that makes tool calls, collects outputs, and reasons over results can burn through context in a few dozen steps. And large contexts cost more per token on most providers.

The practical ceiling for cost-effective operation is often lower than the technical limit. A 128K-token context window is available, but filling it completely on every request gets expensive fast.

Strategy 1: Sliding Window

The simplest approach is to keep only the N most recent turns:

def trim_messages(messages: list, max_turns: int = 20, system_msg: dict = None) -> list: # Always preserve the system prompt non_system = [m for m in messages if m["role"] != "system"] # Keep only the most recent turns # A "turn" is a user+assistant pair, so we keep max_turns * 2 messages trimmed = non_system[-(max_turns * 2):] if system_msg: return [system_msg] + trimmed return trimmed

This caps your context cost at a known maximum. The downside is that old context gets dropped entirely. For many tasks -- coding help, Q&A, step-by-step workflows -- this is fine. Users rarely need the model to recall something from 30 turns ago.

Where sliding window breaks down is in long research or planning tasks where earlier decisions constrain later ones. Dropping the context for "I decided to use PostgreSQL because the team already has it running" means the model might suggest SQLite three turns later.

A few practical notes on sliding windows:

  • Always keep the system prompt. It should never roll off.
  • Prefer dropping from the oldest end, not the middle.
  • Drop user+assistant pairs together. Dropping just the user message while keeping the assistant response creates a confused context where the model appears to have answered nothing.
  • Set your window size based on your model's context limit minus your expected response length and tool outputs. Leave headroom.

Strategy 2: Summarization

Rather than dropping old context, you compress it. When the conversation exceeds a threshold, you call the model to summarize the history so far, then replace those messages with a single summary message:

async def summarize_older_history(client, messages: list, keep_recent: int = 10) -> list: if len(messages) <= keep_recent + 1: # +1 for system return messages system_msg = next((m for m in messages if m["role"] == "system"), None) non_system = [m for m in messages if m["role"] != "system"] to_summarize = non_system[:-keep_recent] to_keep = non_system[-keep_recent:] summary_prompt = [ {"role": "user", "content": ( "Summarize the following conversation history concisely. " "Preserve key decisions, facts established, and any important context " "that might be needed later.\n\n" + "\n".join(f"{m['role'].upper()}: {m['content']}" for m in to_summarize) )} ] summary_response = await client.chat.completions.create( model="qwen3-coder", messages=summary_prompt, max_tokens=500 ) summary_text = summary_response.choices[0].message.content summary_message = { "role": "user", "content": f"[Earlier conversation summary: {summary_text}]" } result = [] if system_msg: result.append(system_msg) result.append(summary_message) result.extend(to_keep) return result

Summarization preserves semantic content at the cost of an extra API call and some latency. It works well when:

  • The early conversation contains important decisions or constraints
  • You're running an agent over a long session (hours, not minutes)
  • Exact phrasing from earlier turns doesn't matter, only the meaning

The cost of the summarization call itself needs to factor into your budget. On fast inference providers, this adds latency on the order of a few hundred milliseconds, which is usually acceptable as a background operation.

Strategy 3: External Memory

For long-running or stateful agents, storing memory outside the context window entirely is often the right call. Instead of feeding all history to the model, you retrieve only what's relevant to the current step.

The basic pattern:

from typing import TypedDict class MemoryStore: def __init__(self, embedding_client, vector_db): self.embedder = embedding_client self.db = vector_db async def store(self, turn: dict): text = f"{turn['role']}: {turn['content']}" embedding = await self.embedder.embed(text) self.db.upsert({"text": text, "embedding": embedding, "turn_id": turn["id"]}) async def retrieve(self, query: str, top_k: int = 5) -> list[str]: query_embedding = await self.embedder.embed(query) results = self.db.query(query_embedding, top_k=top_k) return [r["text"] for r in results] async def build_context_with_memory(store, current_query: str, recent_messages: list) -> list: relevant_memories = await store.retrieve(current_query) memory_block = "\n".join(relevant_memories) memory_message = { "role": "system", "content": f"Relevant context from earlier in the conversation:\n{memory_block}" } return [memory_message] + recent_messages

This approach lets an agent maintain context across sessions that span hours or days without ever hitting a context limit. The tradeoff is complexity: you need an embedding model, a vector store, and retrieval logic, and retrieval quality determines what the agent "remembers."

For most agents, a hybrid approach works well: keep a short sliding window of recent messages for immediate context, and use vector retrieval for older information.

Managing Tool Call History

Agents that make tool calls accumulate large message sequences. Each tool call produces a tool_calls message from the assistant and a tool role message with the result. These can be long -- a web search result, a file read, a database query.

A few approaches for keeping tool call history manageable:

Summarize large tool results before storing them. If a tool returns 10,000 tokens of raw data, have the model extract the relevant parts before adding it to the messages list:

async def store_tool_result(client, tool_name: str, raw_result: str, max_tokens: int = 300) -> str: if count_tokens(raw_result) <= max_tokens: return raw_result response = await client.chat.completions.create( model="qwen3-coder", messages=[{ "role": "user", "content": f"Extract only the key facts from this {tool_name} result:\n\n{raw_result}" }], max_tokens=max_tokens ) return response.choices[0].message.content

Drop tool messages from the sliding window, not conversation turns. Tool results are often single-use: they mattered when the model used them, but future turns rarely need to see the raw output. You can drop old tool result messages more aggressively than you drop user/assistant turns.

Keep tool call structure even when dropping content. The model needs to know what tools were called to avoid repeating work. If you drop a tool result, replace it with a placeholder: "[result from search_web call #3 -- dropped to save context]". This tells the model what happened without paying for the full token cost.

Cost Optimization at Scale

If you're running many concurrent agents or high-turn sessions, the token costs compound. A few specific levers:

Count tokens before sending. Don't wait for an API error to discover you've exceeded the context limit. Most tokenizers have a fast local token count. Budget 20--30% headroom for the response:

import tiktoken def estimate_tokens(messages: list, model: str = "gpt-4") -> int: enc = tiktoken.encoding_for_model(model) total = 0 for message in messages: total += 4 # overhead per message for value in message.values(): if isinstance(value, str): total += len(enc.encode(value)) return total + 2 # reply priming

Use prefix caching if your provider supports it. Prefix caching reuses KV cache across requests for shared prompt prefixes -- like a system prompt or a long document you reference on every turn. General Compute supports prefix caching, which means you only pay for the new tokens added to a cached prefix, not the full context on each call.

Route by conversation length. Short conversations with a few turns are fine on larger, more capable models. Very long conversations where you've already summarized most of the context into a compact form can be handled by a smaller, faster, cheaper model. Building a routing layer that chooses model based on current context length pays off at scale.

Set token budgets per agent. In production, an agent that enters a bad loop can run up a large bill before a human notices. Set a hard limit on the total tokens consumed per session and terminate or escalate when it's hit.

Choosing the Right Pattern

The choice between sliding window, summarization, and external memory comes down to your latency and fidelity requirements:

| Pattern | Latency added | Memory fidelity | Complexity | |---|---|---|---| | Sliding window | None | Low (drops context) | Very low | | Summarization | 200--500ms per summary | Medium (semantic) | Low | | External memory | 50--150ms per retrieval | High (retrieved) | Medium | | Hybrid | 50--500ms | High | Medium |

For most stateless or short-session agents, a sliding window with a reasonable turn limit (10--20 turns) is all you need. For long-running research or planning agents where earlier context matters, add summarization or external memory.

The pattern you choose also affects which models you can use. Sliding window works with any model. Summarization benefits from a fast model to keep the extra call from adding perceptible latency. External memory requires an embedding model alongside your generation model.

Putting It Together

A production-grade conversation manager might look like this:

class ConversationManager: def __init__(self, client, max_recent_turns: int = 15, summary_threshold: int = 30): self.client = client self.max_recent = max_recent_turns self.summary_threshold = summary_threshold self.messages = [] self.system_prompt = None def set_system_prompt(self, content: str): self.system_prompt = {"role": "system", "content": content} def add_turn(self, role: str, content: str): self.messages.append({"role": role, "content": content}) async def get_context(self) -> list: msgs = self.messages # Summarize if we have too many turns if len(msgs) > self.summary_threshold: msgs = await summarize_older_history( self.client, msgs, keep_recent=self.max_recent ) self.messages = [m for m in msgs if m["role"] != "system"] # Trim to sliding window non_system = [m for m in msgs if m["role"] != "system"] trimmed = non_system[-(self.max_recent * 2):] result = [] if self.system_prompt: result.append(self.system_prompt) result.extend(trimmed) return result async def chat(self, user_message: str) -> str: self.add_turn("user", user_message) context = await self.get_context() response = await self.client.chat.completions.create( model="qwen3-coder", messages=context ) assistant_message = response.choices[0].message.content self.add_turn("assistant", assistant_message) return assistant_message

This isn't a complete production system -- you'd want persistence, error handling, token counting, and logging on top of this. But the structure shows how the pieces fit: the manager owns the message list, applies the appropriate trimming strategy, and keeps the context window under control.

What Actually Goes Wrong in Production

A few failure modes that come up repeatedly:

Forgetting to carry the system prompt. When you build a trimmed context, it's easy to accidentally omit the system prompt. The agent loses its persona, constraints, and instructions. Always explicitly prepend the system prompt to your trimmed message list.

Dropping half a turn. If you drop the user message but keep the assistant response, the model sees it answered a question that wasn't asked. Drop complete turns (user + assistant pair) together.

Summarizing too infrequently. If you wait until you're near the context limit to summarize, you're summarizing a huge chunk of history in one go. Summarize earlier and more frequently to keep the summary task manageable.

Not testing at the edges. Most conversation bugs only appear at turn 40 or turn 80, not in your unit tests. Include integration tests that run realistic long sessions to catch context management bugs before they hit production.


General Compute's API is OpenAI-compatible, so these patterns work directly with our endpoint by swapping the base URL. Our inference speeds make strategies like in-context summarization faster -- when the extra summarization call takes 200ms instead of 2 seconds, you can afford to run it more frequently and keep history tighter. Check out the General Compute docs to get started.

ModeHumanAgent