We raised $15M to build the world's fastest neocloud.Read
inferencelatencyreal time inferenceoptimizationproduction

Real-Time AI Inference: How to Achieve <100ms Latency in Production

General Compute·

Getting AI inference under 100ms in production is not magic. It is the result of making good decisions at each layer of the stack, from the model you choose down to where you physically place your servers. This guide walks through every layer, explains the trade-offs, and gives you concrete numbers to work with.

Why 100ms Is the Right Target

Human perception research has established that users start noticing latency around 100ms. Below that threshold, an interaction feels immediate. Above it, there is a perceptible delay -- and for voice AI, coding assistants, and interactive applications, a perceptible delay breaks the experience.

For voice specifically, the constraint is tighter. A voice agent needs to receive audio, run transcription, call an LLM, and synthesize speech, all within a window of roughly 500-800ms before the response feels natural. Each component in that pipeline has a budget, and the LLM call typically gets 100-200ms of it.

For code completion, the threshold matters too. Tools like GitHub Copilot have shown that if suggestions take more than 300ms to appear, acceptance rates drop significantly. At sub-100ms, the suggestion arrives while the developer is still thinking, which is when it actually helps.

So 100ms is not an arbitrary target. It is the latency boundary between a feature that feels fast and one that feels slow.

Where Latency Actually Comes From

Before optimizing anything, it helps to know where time is spent. A production LLM inference request has several distinct phases:

Network latency: The round-trip from your user to your inference server. Speed of light is non-negotiable, but server placement is not.

Prefill (time-to-first-token, TTFT): The model processes the entire input prompt and generates the first output token. This scales with prompt length and is compute-bound.

Decode (token generation time, TGT): Each subsequent token is generated one at a time, reading from the KV cache. This is memory-bandwidth-bound.

Network back: The first token (and subsequent streaming tokens) travel back to the client.

For a 100ms total budget, a realistic breakdown for a short-prompt completion might be: 10ms network one-way, 40ms prefill, 10ms per-token decode for the first token, 10ms network return. That leaves almost no slack. Every component matters.

Model Selection: The Biggest Lever

The single biggest factor in inference latency is model size. A 7B parameter model runs roughly 10x faster than a 70B model on the same hardware. If you need sub-100ms responses, you almost certainly need to be running a model in the 1B to 14B range for most requests.

This does not mean sacrificing quality across the board. It means routing intelligently:

  • Use a small, fast model (1-7B) for simple completions, classification, and short-context tasks.
  • Reserve a larger model (30-70B) for complex reasoning tasks where you have more latency budget.
  • Use cascade routing to try the small model first and escalate only when confidence is low.

Model architecture also matters beyond parameter count. Models with grouped-query attention (GQA) have smaller KV caches, which means faster decode. Models using sliding window attention handle long contexts without quadratic prefill scaling. When selecting a model, check whether it uses GQA -- most modern models do, but it is worth confirming.

Speculative decoding is another architecture-level technique worth considering. A small draft model generates several candidate tokens, and the large model verifies them in parallel. If the candidates match what the large model would have generated (which happens often for predictable text), you get multiple tokens for roughly the cost of one. Practical speedups are typically 2-3x for conversational tasks.

Quantization: Faster Inference, Smaller Footprint

Quantization reduces the precision of model weights from 16-bit (bfloat16) down to 8-bit (INT8) or 4-bit (INT4). The effect on latency is significant because lower precision means smaller memory footprint and faster memory reads during decode.

Here is what the trade-offs look like in practice:

| Format | Memory reduction | Quality impact | Recommended use | |--------|-----------------|----------------|-----------------| | BF16 | baseline | none | highest-quality production | | INT8 | ~50% | minimal (<1%) | standard production | | INT4 (GPTQ/AWQ) | ~75% | noticeable (2-5%) | latency-critical, lower stakes | | FP8 | ~50% | minimal | new hardware (H100/H200) |

For latency-critical applications, INT8 is usually the right choice. You get meaningful memory savings and the quality difference is negligible for most tasks. INT4 makes sense when you need to fit a larger model on available VRAM, or when you need maximum throughput and can tolerate slightly lower quality.

AWQ (Activation-Aware Weight Quantization) is generally better than GPTQ for quality at INT4. If you are going to quantize aggressively, use AWQ.

The KV Cache and Batching

The KV cache stores computed key-value pairs for each token in the input, so the model does not recompute them during each decode step. Managing it well is critical for both latency and throughput.

Prefix caching is one of the highest-ROI optimizations if your workload includes repeated system prompts. If 100 requests all start with the same 500-token system prompt, the KV cache for those 500 tokens can be shared. Prefill cost drops to near-zero for subsequent requests with the same prefix. Most production serving frameworks (vLLM, TensorRT-LLM) support this.

Continuous batching (also called iteration-level scheduling) allows the inference server to add new requests to a running batch between decode steps, rather than waiting for an entire batch to finish. This dramatically improves GPU utilization and reduces queuing latency for incoming requests.

For latency-critical workloads, you want batch sizes to be small (sometimes 1). Large batches maximize throughput but increase per-request latency because each request waits for others in the batch to finish. Tune batch size to your latency target, not to GPU utilization.

Chunked prefill separates the prefill phase into smaller chunks interleaved with decode steps. For a long prompt, this prevents the prefill from blocking decode for other in-flight requests. The result is more predictable latency across the board.

Parallelism and Hardware Configuration

How you spread a model across GPUs matters for latency.

Tensor parallelism splits the model's weight matrices across multiple GPUs, so each forward pass happens across all GPUs simultaneously. This reduces per-token latency because more compute is applied in parallel. For latency-critical inference, tensor parallelism on 2-4 GPUs often gives better per-request latency than running one full model per GPU.

Pipeline parallelism splits the model into stages across GPUs, with each GPU running a subset of layers. This maximizes throughput but adds inter-layer communication overhead. It is better suited for batch-heavy workloads than latency-critical ones.

Hardware choice: For sub-100ms inference, you want hardware with high memory bandwidth, not just raw FLOPS. Decode is memory-bandwidth-bound, and a GPU with faster HBM will decode tokens faster than one with more raw compute but slower memory. The H100 and H200 are the current standard for production inference.

Infrastructure Placement

Network latency from your inference server to your users is often overlooked during optimization, but it can easily account for 30-50ms of your total budget.

Speed of light imposes a floor: a request from New York to a server in London and back will take at least 70ms just for the round trip, with no processing at all. For sub-100ms total latency, that leaves essentially nothing for inference.

Practical approaches:

Colocation: Place your inference server in a data center close to your largest user concentration. For US-only products, a US-East server covers ~60% of users with under 20ms network latency.

Regional deployment: For global products, run inference servers in multiple regions and route each request to the nearest one. Even two regions (US and EU) covers the majority of internet users with reasonable latency.

Edge inference: For ultra-low latency requirements, running smaller quantized models on edge nodes can get network latency under 5ms. This makes sense for applications where inference quality on a small model is sufficient (e.g., intent classification, keyword spotting before routing to a larger model).

Using a managed inference API rather than self-hosting gives you the option to select serving regions without building out your own multi-region infrastructure. Check whether your provider offers regional endpoints and what latency each region adds for your user base.

Measuring Latency Correctly

Optimization requires measurement. The right metrics:

TTFT (Time to First Token): The time from sending the request to receiving the first response token. This is what users perceive as the initial delay and is the most important metric for interactive applications.

Inter-token latency (ITL): The average time between tokens during the decode phase. For streaming applications, this determines how smooth the output feels.

P99 latency: The 99th percentile latency, not the median. Median latency tells you about the average case. P99 tells you what your slowest users experience, and slow users are often the ones who churn.

A reasonable measurement setup:

import time import httpx def measure_ttft(client, prompt): start = time.perf_counter() first_token_time = None with client.stream("POST", "/v1/chat/completions", json={ "model": "your-model", "messages": [{"role": "user", "content": prompt}], "stream": True, "max_tokens": 100 }) as response: for chunk in response.iter_lines(): if chunk and first_token_time is None: first_token_time = time.perf_counter() break return (first_token_time - start) * 1000 # ms

Run this from the same network location as your users, not from your own development machine or CI environment. Latency measured from a data center will look much better than what your users actually experience.

A Practical Optimization Sequence

When starting from scratch, work through these steps in order:

  1. Measure your baseline from a realistic network location. Know your current P50 and P99 TTFT.

  2. Select a smaller model if you are over budget. Try the smallest model that meets your quality bar.

  3. Enable prefix caching if your prompts share a common system prompt. This is often a free 30-50ms improvement.

  4. Apply INT8 quantization if you are not already quantized. This reduces memory pressure and speeds decode.

  5. Check your batch size configuration. For interactive workloads, smaller batches reduce queuing latency.

  6. Measure network latency from your users' regions. If more than 30ms, consider a regional deployment.

  7. Profile your P99 latency, not just median. Tail latency often reveals specific prompt patterns or model loading issues.

  8. Consider speculative decoding if you are close to your target but not quite there. It requires a matching draft model but can deliver 2-3x decode speedup.

What 100ms Actually Requires

Getting to sub-100ms in production for a general-purpose LLM task is achievable, but it requires making deliberate choices. It means using a model in the 7-14B range with INT8 or INT4 quantization, running on well-placed hardware close to your users, with prefix caching enabled and batch sizes tuned for latency.

If you are running a 70B model in a single US region and routing all global traffic through it, 100ms is not achievable. If you select the right model for your task and build around it with proper infrastructure, it is.

The optimization levers in order of impact: model size, network placement, quantization, batching strategy, and hardware selection. Fix them in that order and you will get there.


General Compute's inference API is optimized for TTFT and runs on custom ASIC infrastructure designed for low latency. If you are hitting a latency wall on your current provider, try the General Compute API -- the same OpenAI-compatible endpoints, with response times that make the 100ms target realistic.

ModeHumanAgent