Agent Readout

What Is AI Inference? A Developer's Complete Guide

A practical, end-to-end explanation of AI inference: what it is, how the pipeline works, the metrics that matter, the hardware that runs it, and the trade-offs you face when you put a model in production.

Author: General Compute
Published: 2026-05-29
Tags: inference, fundamentals, llm, production, latency

Markdown body

If you have spent any time around machine learning, you have heard the word inference thrown around as if everyone already agrees on what it means. In practice, the definition people carry in their heads is fuzzy, and that fuzziness causes real problems when it comes time to estimate costs, pick hardware, or debug why a model is slow. This guide answers the question "what is AI inference" directly, then walks through the full pipeline, the metrics worth tracking, the hardware options, and the production trade-offs that actually determine whether your application feels fast or feels broken.

## What is AI inference?

AI inference is the process of using a trained model to produce an output from new input. Training is when a model learns its weights from data. Inference is everything that happens afterward, every time you actually run the model to get an answer. When you send a prompt to a language model and it streams back a response, that is inference. When a vision model labels an image, that is inference. When a recommendation system scores a list of items, that is inference.

The distinction matters because the two phases have completely different economics. Training is a one-time (or occasional) cost: you spend a lot of compute up front, and then you have a model. Inference is a recurring cost that scales with usage. A model you train once might serve billions of inference requests over its lifetime. For most companies running models in production, inference is where the bulk of the compute bill goes, and it is the part of the system that users actually feel.

For large language models specifically, inference has a structure worth understanding in detail, because that structure explains nearly everything about why some setups are fast and others crawl.

## The two phases of LLM inference

When a language model processes a request, the work splits into two distinct phases: prefill and decode.

**Prefill** is when the model reads your input prompt. All of the input tokens are processed in parallel in a single forward pass. This is compute-heavy and uses the hardware efficiently, because the GPU can work on many tokens at once. The output of prefill is the first generated token, plus a populated KV cache (more on that below). The time it takes to finish prefill is what determines time-to-first-token.

**Decode** is when the model generates the response, one token at a time. Each new token depends on all the tokens before it, so this phase is inherently sequential. You generate a token, append it, run another forward pass, generate the next token, and so on. Decode is memory-bandwidth-bound rather than compute-bound, because for each single token you have to read the entire model's weights out of memory. This is the slow, expensive part of generation, and it is why a long response costs more than a short one even when the prompt is identical.

The reason this split matters: a request with a huge prompt and a tiny output is dominated by prefill, while a request with a small prompt and a long output is dominated by decode. These two regimes stress different parts of the hardware, and a serving stack tuned for one can be poorly suited to the other.

## The KV cache

The single most important data structure in LLM inference is the KV cache. During the attention computation, the model produces key and value vectors for every token. Without caching, generating each new token would require recomputing the keys and values for the entire sequence so far, which would make generation quadratically expensive as the sequence grows.

The KV cache stores those key and value vectors so they only have to be computed once. When you generate token number 500, you reuse the cached keys and values for tokens 1 through 499 and only compute the new ones. This turns an otherwise quadratic process into a linear one.

The catch is that the KV cache consumes memory, and it grows with both the sequence length and the batch size. For a large model serving long contexts to many users at once, the KV cache can easily consume more memory than the model weights themselves. A great deal of inference optimization work, things like paged attention, multi-query attention, and KV cache quantization, exists specifically to make the KV cache cheaper to store and faster to access.

## The inference pipeline, end to end

Putting the pieces together, here is what happens when a request hits an inference server:

1. **Tokenization.** The input text is converted into token IDs using the model's tokenizer. This is fast but not free, and it happens on the CPU.
2. **Scheduling and batching.** The server groups your request with others to use the hardware efficiently. Modern servers use continuous batching, which lets new requests join an in-flight batch rather than waiting for the whole batch to finish.
3. **Prefill.** The batched prompts are processed, the KV cache is populated, and the first token is produced.
4. **Decode loop.** Tokens are generated one at a time until the model emits a stop token or hits the maximum length. Tokens are typically streamed back to the client as they are produced.
5. **Detokenization.** Token IDs are converted back to text before being sent to the user.

Most of the latency a user perceives comes from steps 3 and 4. Most of the cost comes from step 4 running across many requests. Everything else is overhead that good serving stacks keep small.

## The metrics that actually matter

If you only track one number, you will optimize the wrong thing. Inference performance is multi-dimensional, and these are the metrics worth watching:

- **Time-to-first-token (TTFT).** How long from sending the request until the first output token arrives. Dominated by prefill and by queueing delay. This is what makes an application feel responsive or sluggish, because it is the dead air before anything happens.
- **Inter-token latency (ITL), or time-per-output-token.** How long between each generated token during decode. This determines how fast the response appears to stream. Anything faster than human reading speed feels smooth.
- **Tokens per second (TPS).** The throughput of generation, sometimes measured per request and sometimes aggregated across the whole server. Per-request TPS and total-server TPS are different numbers and people conflate them constantly.
- **Throughput.** Total tokens or requests the system handles per unit time across all users. This is what determines your cost per token at scale.
- **Latency percentiles (p50, p95, p99).** Averages hide the tail. A system with a great median and a terrible p99 will produce a steady stream of frustrated users even if it looks fine on a dashboard.

There is a fundamental tension between latency and throughput. Larger batches improve throughput because the hardware does more useful work per memory read, but they can increase latency for any individual request because of queueing and contention. Tuning a serving system is largely the work of finding the right balance for your workload.

## The hardware that runs inference

Inference can run on several kinds of hardware, and the right choice depends on the model size, the latency target, and the budget.

**GPUs** are the default for most serious LLM inference. Their high memory bandwidth is well matched to the memory-bound decode phase, and their parallelism suits prefill. Data-center GPUs with large, fast memory are what most providers run.

**CPUs** can handle inference for small models or low-throughput workloads, and they are everywhere, which makes them convenient for prototyping or edge deployment. They struggle with large models because they lack the memory bandwidth and parallel throughput that decode demands.

**Specialized accelerators**, including custom ASICs built specifically for inference, target the parts of the workload that general-purpose GPUs handle less efficiently. Because they are designed for one job rather than the full range of machine learning workloads, they can push latency and throughput well beyond what comparable general-purpose hardware achieves. This is the approach we take at General Compute: purpose-built silicon aimed squarely at fast inference rather than training.

The practical point is that hardware choice is downstream of your requirements. A batch job that processes documents overnight has very different needs than a voice agent that must respond within a few hundred milliseconds.

## Production trade-offs

Once you move past a demo and into production, a set of recurring trade-offs shows up.

**Latency versus cost.** You can almost always make inference faster by throwing more hardware at it or running smaller batches, and you can almost always make it cheaper by batching aggressively and accepting higher latency. Where you land depends on what your application needs. A real-time assistant lives or dies on latency. An offline summarization pipeline can favor throughput.

**Model size versus quality.** A bigger model is usually more capable and always more expensive to serve. Quantization (running the model at lower numerical precision, such as INT8 or FP8) and distillation (training a smaller model to mimic a larger one) are the standard tools for getting most of the quality at a fraction of the cost. Whether the quality loss is acceptable is something you have to measure on your own task, not assume.

**Context length versus memory.** Longer contexts let the model see more, but the KV cache grows with context length, which limits how many requests you can serve at once. Long-context features are not free, and serving them at scale requires either more memory or techniques that compress the cache.

**Prefix caching.** If your requests share a common prefix, for example a long system prompt reused across every call, caching that prefix's KV state means you do not recompute it on every request. This can dramatically cut TTFT and cost for agent and chat workloads, but only if your request formatting is stable enough to actually hit the cache.

None of these trade-offs have a universally correct answer. They are dials you set based on what your specific application values, and the act of setting them well is most of what production inference engineering is.

## A minimal example

Because General Compute exposes an OpenAI-compatible API, running inference against a hosted model looks like this:

```python
from openai import OpenAI

client = OpenAI(
base_url="https://api.generalcompute.com/v1",
api_key="your-api-key",
)

stream = client.chat.completions.create(
model="your-model",
messages=[{"role": "user", "content": "Explain inference in one sentence."}],
stream=True,
)

for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
```

The `stream=True` flag matters here. Streaming sends tokens to the client as they are decoded, which means the user starts reading after the first token rather than waiting for the entire response. It does not make the underlying inference faster, but it changes how fast the result feels, and that perceived latency is what users judge you on.

## Where to go from here

If you take one thing from this guide, let it be that inference is not a single operation with a single cost. It is a pipeline with two distinct phases, a memory structure that drives most of the optimization work, and a handful of metrics that pull against each other. Understanding that structure is what lets you reason about why your application is slow, what it will cost at scale, and which knob to turn first.

The fastest way to build intuition is to run real workloads and watch the numbers. If you want to see what low-latency inference feels like in practice, you can point the OpenAI SDK at the General Compute API and measure TTFT and tokens per second on your own prompts. The [documentation](https://generalcompute.com) covers the supported endpoints and models, and the API is a drop-in swap for most existing code.