Agent Readout
What Are Agentic AI Systems? How to Build Them With Fast Inference
Agentic AI systems chain LLM calls into autonomous loops that plan, act, and observe. This guide covers the core components, the main reasoning patterns (ReAct, Plan-and-Execute, Tree of Thoughts), and how inference speed shapes what you can actually build.
- Author
- General Compute
- Published
- 2026-06-27
- Tags
- agents, agentic ai systems, inference, langchain, langgraph
Markdown body
A chatbot answers a question. An agentic AI system accomplishes a goal. The difference is that the system does not wait for the next user message between each step. It plans, uses tools, checks the result, and keeps going until the task is done or it determines it cannot continue.
This sounds deceptively simple. In practice, building an agent that behaves reliably in production involves understanding the components that make up the loop, the reasoning patterns that govern how the loop runs, and the infrastructure constraints that determine which designs are actually feasible. Fast inference sits at the center of most of those constraints.
## What an agentic system actually is
An agentic AI system is a program that runs an LLM inside a loop. Each iteration of the loop consists of three phases: observe, think, and act.
**Observe** means taking in the current state. That might be a user message, the output of the last tool call, an error from a failed action, or a retrieved document.
**Think** means passing that state to the model and generating a decision. The model might decide to call a tool, generate a piece of code, ask a clarifying question, or declare the task complete.
**Act** means executing the decision and capturing the output to feed back into the next observation.
The loop continues until a stopping condition is met: the goal is achieved, a maximum step count is reached, the model outputs a stop signal, or an error is unrecoverable.
What distinguishes an agent from a simple chain of prompts is that the next step is not predetermined. The model decides at each iteration what to do based on what it just learned. This is what makes agents powerful: they can handle tasks that require variable-length reasoning where you cannot know in advance how many steps the task will take.
## Core components
Every agentic system, regardless of framework, has the same basic parts.
**The model.** The LLM that drives decisions. In most systems this is a single model used for all steps, though some architectures use a larger model for planning and a smaller one for execution.
**The context window.** Everything the model knows at any given step. This includes the system prompt, the conversation history, recent tool outputs, and any retrieved documents. Managing what goes into the context window is one of the harder engineering problems in agents, because context windows are finite and most agent tasks generate more content than fits.
**The tool registry.** The set of actions available to the model. Tools are functions the agent can call: searching the web, reading a file, querying a database, writing code, sending an email. The model generates a structured call (usually JSON), the runtime executes it, and the result comes back as an observation.
**The memory system.** How state persists across steps and across sessions. Short-term memory is the context window. Long-term memory is typically a vector database, a key-value store, or a structured log that the agent can query. The design of the memory system determines what the agent can remember and how fast retrieval is.
**The executor.** The runtime that actually runs the loop, dispatches tool calls, manages retries, and handles errors. This is where frameworks like LangGraph, CrewAI, and AutoGen live.
## Reasoning patterns
The pattern you choose for structuring the agent's thinking determines how the model moves from observation to action. There are three main patterns in current use.
### ReAct
ReAct (Reasoning and Acting) is the most widely deployed pattern. The model is prompted to emit a reasoning trace before each action:
```
Thought: I need to find the current price of AAPL stock.
Action: search_web("AAPL stock price today")
Observation: Apple Inc. (AAPL) is trading at $212.40 as of market close.
Thought: I have the price. I can now answer the question.
Action: finish("AAPL is trading at $212.40.")
```
The interleaved reasoning makes the model's decisions more interpretable and often more reliable. When the model has to explain what it is doing before doing it, it tends to catch mistakes in its own plan before they turn into failed tool calls.
ReAct works well for tasks with clear structure: fetch data, transform it, report it. It is less effective for tasks that require genuine multi-step planning, because the model only looks one step ahead.
### Plan-and-Execute
Plan-and-Execute separates planning from acting. In the first phase, the model receives the goal and produces a complete plan as a list of steps. In the second phase, a separate agent (or the same model in a different mode) executes each step, optionally revising the plan when something unexpected happens.
```python
# Phase 1: planning
plan = planner_llm.generate(
system="You are a planning agent. Break this task into steps.",
user=goal
)
# returns: ["Step 1: ...", "Step 2: ...", ...]
# Phase 2: execution
for step in plan:
result = executor_llm.generate(
system="You are an execution agent. Complete this step.",
user=step,
context=accumulated_results
)
accumulated_results.append(result)
```
This pattern is useful when the task has enough structure that you can plan it upfront, and when the execution of each step is relatively independent. It handles long-horizon tasks better than ReAct because the full plan is explicit and can be inspected or edited.
The downside is rigidity. If step 3 produces a result that invalidates step 4, a pure Plan-and-Execute system either fails or requires a replanning pass, which adds another LLM call.
### Tree of Thoughts
Tree of Thoughts extends the idea of chain-of-thought by exploring multiple reasoning paths simultaneously and using search (BFS, DFS, or best-first) to find the best solution.
Each node in the tree is a partial solution. The model generates multiple candidate next steps from each node, evaluates them, and expands the most promising ones. This is computationally expensive because each node requires at least one LLM call, and the tree can grow large.
```
Goal
├── Plan A
│ ├── A.1 (evaluated: good)
│ │ ├── A.1.1 (evaluated: good)
│ │ └── A.1.2 (evaluated: poor, pruned)
│ └── A.2 (evaluated: poor, pruned)
└── Plan B
└── B.1 (evaluated: good)
└── B.1.1 (evaluated: best, selected)
```
Tree of Thoughts is rarely used in production today because the cost in LLM calls is too high at current inference speeds. It is more of a research result than a practical pattern. That said, as inference gets faster, the cost of tree search drops proportionally, and some teams are starting to use lightweight versions (two or three candidates per step rather than full search) in production for high-value tasks.
## Frameworks: LangGraph, CrewAI, AutoGen
You do not need a framework to build an agent, but most teams use one because they handle the boilerplate: state management, tool dispatch, retry logic, streaming, and observability.
### LangGraph
LangGraph models agents as directed graphs where nodes are functions (often LLM calls or tool calls) and edges define control flow. The graph state is a typed dictionary that each node reads from and writes to.
```python
from langgraph.graph import StateGraph, END
from typing import TypedDict
class AgentState(TypedDict):
messages: list
tool_calls: list
final_answer: str | None
def call_model(state: AgentState):
response = llm.invoke(state["messages"])
return {"messages": [response]}
def call_tool(state: AgentState):
results = execute_tools(state["tool_calls"])
return {"messages": results}
def should_continue(state: AgentState):
last = state["messages"][-1]
if last.tool_calls:
return "call_tool"
return END
graph = StateGraph(AgentState)
graph.add_node("call_model", call_model)
graph.add_node("call_tool", call_tool)
graph.add_conditional_edges("call_model", should_continue)
graph.add_edge("call_tool", "call_model")
```
LangGraph is a good fit when the agent's control flow is nontrivial: human-in-the-loop steps, branching on tool results, parallel sub-tasks. The graph abstraction makes the flow inspectable and testable.
### CrewAI
CrewAI organizes agents into crews: groups of specialized agents that collaborate to complete a task. Each agent has a role, a goal, and a backstory that shapes how it behaves. A crew has a process (sequential or hierarchical) that determines how agents hand off work.
```python
from crewai import Agent, Task, Crew, Process
researcher = Agent(
role="Research Analyst",
goal="Find accurate information about the given topic",
tools=[search_tool, browser_tool]
)
writer = Agent(
role="Content Writer",
goal="Write a clear summary based on research findings",
tools=[]
)
research_task = Task(
description="Research the current state of LLM inference hardware.",
agent=researcher
)
write_task = Task(
description="Write a 500-word summary of the research findings.",
agent=writer
)
crew = Crew(
agents=[researcher, writer],
tasks=[research_task, write_task],
process=Process.sequential
)
result = crew.kickoff()
```
CrewAI is useful for tasks that map naturally onto multiple specialized roles. The role-based framing helps when you want different agents to behave differently within the same pipeline.
### AutoGen
AutoGen from Microsoft takes a conversation-centric view. Agents communicate with each other through a shared message protocol. The runtime coordinates who speaks next and handles things like code execution in sandboxed environments.
```python
from autogen import AssistantAgent, UserProxyAgent
assistant = AssistantAgent(
name="assistant",
llm_config={"model": "your-model", "api_base": "https://api.generalcompute.com/v1"}
)
user_proxy = UserProxyAgent(
name="user_proxy",
human_input_mode="NEVER",
code_execution_config={"work_dir": "coding", "use_docker": False}
)
user_proxy.initiate_chat(
assistant,
message="Write and test a Python function that sorts a list of dicts by a given key."
)
```
AutoGen handles code generation and execution well. The UserProxyAgent can run code locally or in a container and feed the result back to the assistant, which is the core loop for code agents.
## Why inference speed determines what you can build
The patterns above all share a property: each step requires at least one LLM call, and steps are sequential because each step depends on the output of the previous one. This is where inference speed becomes a hard constraint, not a preference.
Consider a ReAct agent doing a task that takes 10 steps. If each step pays 600ms in time-to-first-token and generates about 80 tokens at 80 tokens per second (1 second of decode), each call costs roughly 1.6 seconds. The 10-step task takes at minimum 16 seconds, before tool execution.
Swap in a model with 150ms TTFT and 200 tokens per second decode, and the same step costs 0.55 seconds. The 10-step task now takes about 5.5 seconds. Same framework, same tools, same prompts -- roughly 3x faster end to end.
The less obvious effect is on architecture. When steps are cheap, you can afford patterns that would be too slow otherwise:
- Running two candidate plans in parallel and selecting the better one
- Adding a validation step after every tool call
- Using a smaller, faster model for simple steps and a larger model only for the hard ones
- Attempting Tree of Thoughts search at small branching factors
When steps are expensive, you are forced to minimize the number of LLM calls. You write prompts that try to do more in one shot. You skip validation. You accept that failed steps are costly and design around them instead of catching them early.
The agents that handle 20-step tasks today are often constrained by what fits in the latency budget, not by what would produce the best result. As inference gets faster, those constraints loosen.
## Practical considerations for production
**Context management.** Agent tasks accumulate a lot of text. Tool results can be long. Conversation history grows with each step. Left unmanaged, context overflow will truncate earlier observations and break the agent's reasoning. The standard approaches are summarization (periodically compress earlier history into a shorter summary) and retrieval (store old observations externally and pull them back when relevant).
**Structured output reliability.** Agents depend on the model generating valid tool call JSON. Parsing failures cause retries, which compound latency. Use a serving stack that supports grammar-constrained decoding or native function calling to minimize parse failures.
**Observability.** Debugging a 15-step agent from logs is painful. You want traces that show each step's input, output, token count, latency, and whether it was a retry. Most frameworks have integrations with tracing tools (LangSmith, LangFuse, Arize). Set these up before you go to production.
**Failure modes.** Agents fail in ways that chat doesn't. The model can get stuck in a loop, calling the same tool repeatedly without making progress. It can hallucinate tool arguments. It can lose track of the original goal after many steps. You need timeouts, step count limits, and sanity checks. Budget: how many LLM calls am I willing to spend before forcing a stop?
**Testing.** Unit testing individual tools is straightforward. End-to-end testing agents is harder because the path through the agent is non-deterministic. Build a small suite of tasks with known correct outcomes, run the agent against them, and measure success rate and step count. Regression testing becomes important when you change the model or the prompts.
## Getting started
If you are new to building agents, start with ReAct and LangGraph. ReAct is well-studied and its failure modes are understood. LangGraph gives you enough structure to build something maintainable without too much abstraction overhead.
Wire it up to a fast inference endpoint from the start. The biggest mistake teams make is building against a slow model and then trying to optimize later. Latency shapes architectural decisions early in development. If your first prototype runs in 3 seconds per step, you will design around that budget. If it runs in 300ms per step, you will build differently.
General Compute's API is OpenAI-compatible, so pointing an existing LangGraph or AutoGen setup at our endpoint is a one-line change:
```python
from openai import OpenAI
client = OpenAI(
api_key="your-key",
base_url="https://api.generalcompute.com/v1"
)
```
From there, the rest of your agent code stays the same. The latency difference shows up in the wall clock time on your first test run.
Agentic systems are still maturing. The patterns are stabilizing, the frameworks are still changing, and best practices around memory and evaluation are not fully settled. But the fundamentals -- loop, observe, think, act -- are not going anywhere. Building fluency with those fundamentals and the constraints that shape them is the most durable investment you can make as the space evolves.