Agent Readout
AI Inference Latency Explained: TTFT, TPS, and How to Optimize Them
What time to first token and tokens per second actually measure, how to measure them correctly, and a layer-by-layer guide to reducing AI inference latency in production.
- Author
- General Compute
- Published
- 2026-06-12
- Tags
- latency, ttft, tokens per second, inference speed, optimization
Markdown body
AI inference latency is one of those topics where everyone uses the same words to mean different things. One team says their model "responds in 200ms" and means time to first token. Another says the same thing and means total request time for a 50-token completion. A third is quoting a p50 from a load test that never exceeded one concurrent request. All three numbers are useful, but if you don't know which one you're looking at, you can't compare providers, set SLOs, or figure out where your time is actually going.
This post defines the metrics that matter, shows how to measure them correctly, and then walks through where latency comes from at each layer of the stack and what you can do about it. The goal is that by the end, you can look at a latency number from any provider or benchmark and know exactly what it does and doesn't tell you.
## The two metrics that matter: TTFT and TPS
LLM inference has an unusual shape compared to most API workloads. A request doesn't return one response after a fixed amount of work. It returns a stream of tokens, and the first token costs a different amount than every subsequent one. That's why a single "latency" number is almost never enough, and why the industry has settled on two primary metrics.
### Time to First Token (TTFT)
TTFT is the time from sending the request to receiving the first generated token. It covers network transit, queueing, and the prefill phase, where the model processes your entire prompt in one large parallel pass to build the KV cache before it can generate anything.
TTFT is what users perceive as responsiveness. In a chat UI, it's the gap before text starts appearing. In a voice agent, it's dead air. Prefill is compute-bound and scales with prompt length, so TTFT for a 200-token prompt and a 20,000-token prompt can differ by an order of magnitude on the same hardware. Any TTFT figure quoted without a prompt length attached is close to meaningless.
### Tokens Per Second (TPS)
TPS measures how fast tokens arrive after the first one. It's sometimes reported as its inverse, inter-token latency (ITL) or time per output token (TPOT). 50 TPS means a new token every 20ms.
Decode, unlike prefill, is memory-bandwidth-bound. Each new token requires loading the model weights from memory, so TPS is largely a function of model size, quantization, hardware memory bandwidth, and how many other requests are sharing the GPU through batching.
TPS determines how long the full response takes and whether streaming text keeps up with reading speed (roughly 5 to 15 TPS covers human reading; agents and pipelines benefit from far more, since nobody is reading intermediate output).
### Putting them together
Total request latency decomposes cleanly:
```
total_latency = TTFT + (output_tokens - 1) / TPS
```
This decomposition tells you where to focus. A summarization endpoint with 8,000-token inputs and 100-token outputs is dominated by TTFT, so prefill optimizations pay off. A code generation endpoint with short prompts and 2,000-token outputs is dominated by decode, so TPS is what matters. Optimizing the wrong side is a common way to spend a month and move the user-facing number by 5%.
A third metric worth tracking is end-to-end latency at the application level, including retrieval, tool calls, and any pre/post-processing. For agentic workloads that chain multiple model calls, per-call latency compounds: a 10-step agent at 800ms per step is an 8-second task, which is why agent builders care about inference speed more than almost anyone.
## How to measure AI inference latency correctly
Bad measurement produces confident wrong answers. A few rules that prevent most of them:
**Use streaming and timestamp the first token client-side.** If you measure a non-streaming request, you only get total latency and can't separate TTFT from decode. Most OpenAI-compatible APIs support streaming, so the harness is simple:
```python
import time
from openai import OpenAI
client = OpenAI(base_url="https://api.generalcompute.com/v1", api_key="...")
start = time.perf_counter()
first_token_time = None
token_count = 0
stream = client.chat.completions.create(
model="llama-4-maverick",
messages=[{"role": "user", "content": prompt}],
stream=True,
)
for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
if first_token_time is None:
first_token_time = time.perf_counter()
token_count += 1
end = time.perf_counter()
ttft = first_token_time - start
tps = (token_count - 1) / (end - first_token_time)
print(f"TTFT: {ttft*1000:.0f}ms, TPS: {tps:.1f}")
```
**Report percentiles, not averages.** Latency distributions for LLM serving have long tails. Queueing, batch composition, and preemption mean p99 can be 5x p50. Your users experience the tail, and your SLOs should be written against p95 or p99.
**Control prompt and output length.** TTFT scales with input tokens, total time scales with output tokens. Comparing two providers with different prompts, or letting the model decide output length, makes the numbers incomparable. Fix the prompt and set `max_tokens`.
**Measure under realistic concurrency.** Single-request benchmarks measure the best case. Production systems batch requests, and batching trades per-request TPS for aggregate throughput. A provider showing 200 TPS at concurrency 1 might deliver 60 TPS to each user at concurrency 64. Run your load test at the concurrency you actually expect.
**Measure from where your users are.** A benchmark run from a VM in the same region as the inference endpoint hides 50 to 150ms of real-world network time. That can be a third of your TTFT budget.
## Where the time goes: a layer-by-layer breakdown
Once you can measure, the next question is where the milliseconds are hiding. Working from the outside in:
### Network and connection setup
Before any inference happens, you pay for DNS, TCP, and TLS handshakes, plus transit time. Cross-continent round trips alone cost 100 to 250ms. Fixes are standard web engineering: reuse connections (every serious SDK does this if you reuse the client object; creating a new client per request is a surprisingly common mistake), pick endpoints close to your users or your backend, and keep request payloads lean.
### Queueing
When a serving system is saturated, requests wait before any GPU touches them. Queue time shows up entirely in TTFT and is invisible in single-request tests, which is exactly why p99 TTFT explodes under load while p50 looks fine. On the provider side this is fixed with capacity and smarter scheduling. On your side, watch for it: if TTFT degrades as your traffic grows while TPS holds steady, you're queueing.
### Prefill
Prefill cost grows with prompt length, roughly quadratically with very long contexts due to attention, though Flash Attention and similar kernels soften this considerably. The levers:
- **Shorten the prompt.** The cheapest optimization available. Audit your system prompts and few-shot examples; most have accumulated cruft that no eval would miss.
- **Prefix caching.** If many requests share a prefix (a system prompt, a document, conversation history), the serving layer can reuse the KV cache for the shared portion instead of recomputing it. For chat and agent workloads with a 2,000-token system prompt, this routinely cuts TTFT by 50% or more. GeneralCompute applies this automatically for repeated prefixes.
- **Chunked prefill.** Serving frameworks like vLLM and Sarathi-Serve split long prefills into chunks interleaved with decode steps, which stops one user's giant prompt from stalling everyone else's token stream. This is a serving-layer concern, but it's worth knowing whether your provider does it, because it's the difference between stable and spiky inter-token latency under mixed load.
### Decode
Decode speed is bounded by how fast the hardware can stream weights and KV cache from memory. The levers:
- **Smaller or sparser models.** A 8B model decodes several times faster than a 70B on the same hardware. MoE models get you large-model quality with a fraction of active parameters per token. The right question is rarely "what's the best model" but "what's the fastest model that passes my evals."
- **Quantization.** FP8 or INT4 weights mean fewer bytes per forward pass, which directly raises TPS, usually with negligible quality loss at 8-bit and modest loss at 4-bit. Quantizing the KV cache helps long-context decode for the same reason.
- **Speculative decoding.** A small draft model proposes several tokens that the large model verifies in one pass. When the draft is accepted, you get multiple tokens for one large-model step. Speedups of 2 to 3x are realistic on predictable text like code.
- **Better hardware.** Decode is bandwidth-bound, so an H200 (4.8 TB/s) outpaces an H100 (3.35 TB/s) at identical compute specs, and purpose-built inference silicon goes further. This is the layer where GeneralCompute spends most of its effort: custom ASICs designed around memory bandwidth and scheduling for token generation rather than training-oriented FLOPs.
### Batching and scheduling
Batching is the throughput-versus-latency dial. Larger batches amortize weight loads across more requests, raising total system throughput while lowering each request's TPS. Continuous batching (admitting and retiring requests at every iteration rather than batch boundaries) gives most of the throughput benefit while keeping latency reasonable, and is table stakes for modern serving. If you self-host, expect to tune max batch size and watch the p99; if you use a managed API, this tuning is the provider's problem, which is a real and underrated part of what you're paying for.
### The application layer
Finally, the latency your users feel includes everything around the model call. The fixes here are often the largest wins available:
- **Stream to the user.** Streaming doesn't reduce latency, but it moves perceived latency from total time to TTFT, which is usually a 5 to 10x improvement in how fast the product feels.
- **Parallelize independent calls.** Agents and RAG pipelines frequently make sequential calls that have no data dependency. Run them concurrently.
- **Cap output length.** Tokens you don't generate are the fastest tokens. Tight `max_tokens` and prompts that ask for concise output cut total latency directly.
- **Route by difficulty.** Send easy requests to a small fast model and hard ones to a large model. Even a simple heuristic router shifts most traffic to the fast path.
## A worked example
Say you run a RAG-backed support chat: 3,000-token prompts (system prompt plus retrieved context), 300-token answers, and users complain it feels slow. You measure: TTFT p50 is 1.4s, TPS is 45. Total time is 1.4 + 299/45 ≈ 8.0 seconds.
The decomposition says TTFT is 18% of total time and decode is 82%, but perceived latency with streaming is just the 1.4s TTFT. So you do three things: enable prefix caching on the shared 1,200-token system prompt (TTFT drops to roughly 800ms), trim retrieved context from five chunks to three after checking your evals don't regress (TTFT now ~600ms), and turn on streaming in the UI. Perceived latency goes from 8 seconds to 0.6, and you haven't touched the model, the hardware, or the provider.
That's the general pattern: measure, decompose, then fix the layer that actually dominates.
## Reference numbers
Rough targets for production systems in 2026, measured at realistic concurrency from the user's region:
| Use case | TTFT target | TPS target |
|---|---|---|
| Voice agents | < 200ms | 100+ |
| Interactive chat | < 500ms | 30 to 50 |
| Coding assistants | < 400ms | 80+ |
| Agentic pipelines | < 300ms per step | 100+ |
| Batch/offline | irrelevant | maximize throughput per dollar |
These are achievable today on open models. If your current provider can't hit them, the bottleneck is the serving stack, not the model.
## Try it yourself
The measurement harness above works against any OpenAI-compatible endpoint, so the fastest way to ground this post in your own numbers is to run it against your current provider and compare. GeneralCompute's API is OpenAI-compatible and built on custom inference hardware specifically to push TTFT and TPS past what general-purpose GPUs deliver. Point the script at [generalcompute.com](https://generalcompute.com), use your real prompts, and check the percentiles, not the averages.