LLM Token Generation Speed: How Providers Compare in 2025

Token generation speed is one of the most quoted numbers in LLM inference, and also one of the most frequently misused. Providers advertise their peak tokens per second, benchmarkers publish numbers from single-request tests at 3 AM, and developers try to reconcile all of it with what they observe in production. The result is a lot of confusion about what any particular number actually means.

This post establishes a methodology for comparing LLM throughput across providers, presents measured comparisons for a common model size, and walks through the fundamental trade-off between per-request speed and aggregate system throughput.

Two different things called throughput

Before any comparison is useful, you need to agree on what you're measuring. "LLM throughput" gets used to mean two distinct things.

Per-request token generation speed (tokens per second, or TPS) is the rate at which a single user receives tokens during a streaming response. This is what determines how long a user waits for a complete answer. A 300-token response at 100 TPS finishes in 3 seconds; the same response at 30 TPS takes 10 seconds.

System-level throughput is the total tokens generated per second across all concurrent requests. This is what determines the capacity of an inference endpoint: how many users it can serve simultaneously, and at what cost per token.

These numbers have an inverse relationship in practice. When a serving system is under light load and has slack capacity, it can prioritize individual request speed. Under heavy load, it batches requests together to amortize the cost of loading model weights, which raises aggregate throughput while reducing per-request TPS.

A benchmark that runs one request at a time and reports TPS is measuring latency-optimized behavior. A benchmark that hammers an endpoint with hundreds of concurrent requests is measuring throughput-optimized behavior. Both numbers are real; they describe the same system in different operating conditions.

Methodology

Any provider comparison needs to pin down several variables or the results are incomparable.

Model. Use the same model across every provider. For this comparison, we use Llama 3.1 70B Instruct (BF16 or equivalent quality-preserving quantization). This model is widely available, large enough to stress serving infrastructure, and produces consistent output lengths.

Prompt and output length. Prefill cost grows with prompt length, and decode time grows with output tokens. We use a fixed 256-token system prompt plus a user message totaling 512 input tokens, and set max_tokens=512 to standardize output length. Both values are fixed across providers and runs.

Concurrency levels. We test at concurrency 1 (one request at a time), concurrency 8, and concurrency 32. Single-request numbers tell you about latency; higher-concurrency numbers tell you how the system degrades under realistic load.

Measurement location. All requests originate from a single US East region VM. Results include network transit time, which is intentional: what matters is latency from a realistic deployment location, not from a machine co-located with the inference hardware.

What we measure. For each request, we capture TTFT (time from request start to first token received, using a streaming response), per-request TPS (tokens generated divided by streaming duration, excluding TTFT), and total request latency. We run 50 samples per condition and report p50 and p95.

The test harness is standard Python:

import time
from openai import OpenAI

def benchmark_request(client, model, prompt, max_tokens=512):
    start = time.perf_counter()
    first_token_at = None
    token_count = 0

    stream = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=max_tokens,
        stream=True,
    )

    for chunk in stream:
        if chunk.choices and chunk.choices[0].delta.content:
            if first_token_at is None:
                first_token_at = time.perf_counter()
            token_count += 1

    end = time.perf_counter()
    ttft = first_token_at - start
    tps = (token_count - 1) / (end - first_token_at) if token_count > 1 else 0
    return {"ttft_ms": ttft * 1000, "tps": tps, "tokens": token_count}

For concurrent load, we use asyncio and httpx to fire multiple requests simultaneously and record per-request metrics.

Provider comparison: Llama 3.1 70B

The table below shows p50 results from our test run in June 2025. These numbers are representative of provider capabilities at the tested conditions and will change as providers update their infrastructure.

Single request (concurrency 1)

| Provider | TTFT p50 (ms) | TPS p50 | TTFT p95 (ms) | TPS p95 | |---|---|---|---|---| | GeneralCompute | 180 | 310 | 220 | 295 | | Groq | 230 | 215 | 290 | 200 | | Fireworks AI | 410 | 110 | 580 | 95 | | Together AI | 480 | 85 | 720 | 75 | | Replicate | 620 | 65 | 1,100 | 55 |

GeneralCompute leads on both TTFT and TPS at single-request load, which reflects the custom ASIC hardware purpose-built for token generation throughput. Groq is competitive on TTFT (fast prefill) but falls behind on decode speed. GPU-based providers cluster in the 65-110 TPS range, which is what you'd expect from H100 hardware serving a 70B model.

At concurrency 8

| Provider | TTFT p50 (ms) | TPS p50 | TTFT p95 (ms) | TPS p95 | |---|---|---|---|---| | GeneralCompute | 210 | 285 | 310 | 255 | | Groq | 290 | 180 | 820 | 145 | | Fireworks AI | 520 | 95 | 960 | 78 | | Together AI | 610 | 70 | 1,350 | 55 | | Replicate | 890 | 48 | 2,200 | 35 |

Concurrency reveals how providers handle queue pressure. GeneralCompute degrades gracefully: TTFT rises about 17% and TPS drops about 8% going from concurrency 1 to 8. Groq's TTFT p95 jumps substantially (290ms to 820ms), suggesting that queue time variability becomes a factor under modest load. GPU-based providers see larger TPS drops as batching increases.

At concurrency 32

| Provider | TTFT p50 (ms) | TPS p50 | TTFT p95 (ms) | TPS p95 | |---|---|---|---|---| | GeneralCompute | 280 | 245 | 450 | 210 | | Groq | 610 | 140 | 2,400 | 105 | | Fireworks AI | 780 | 72 | 1,800 | 55 | | Together AI | 950 | 52 | 2,900 | 38 | | Replicate | 1,400 | 30 | 4,100 | 22 |

At higher concurrency, the differences widen. Groq's p95 TTFT at concurrency 32 is 2.4 seconds, which would be noticeable in any interactive application. GPU-based providers drop to 30-72 TPS per request, which is serviceable for many use cases but limits what's possible in latency-sensitive pipelines.

The throughput vs latency trade-off in practice

These numbers illustrate a pattern worth understanding before you decide what to optimize.

Batching raises system throughput by lowering per-request speed. When a serving system processes 32 concurrent requests, it can load model weights once and compute forward passes for all 32 requests simultaneously. This dramatically improves tokens per dollar for the operator, but each individual request gets a smaller share of compute per iteration. Per-request TPS drops as concurrency rises for this reason.

The shape of that drop matters. A well-designed serving system should degrade smoothly: doubling concurrency should roughly halve per-request TPS (because you're splitting a fixed resource). If p95 TTFT explodes at moderate concurrency while p50 stays low, that's a sign of queueing problems, where most requests are served quickly but a tail of unlucky requests sits in queue for a long time.

The right metric depends on your application type.

For voice agents and real-time conversation, TTFT under concurrency is what matters most. The user is talking to the model in real time, and a 2-second TTFT is a 2-second silence. TPS matters too (the response has to come faster than the user reads), but high TPS at the cost of unpredictable TTFT is worse than moderate TPS with consistent low latency.

For coding assistants and IDE autocomplete, TPS at low concurrency is the primary variable. The user triggered one request; it needs to finish quickly, and there's no other user to worry about. Single-request TPS is a near-perfect proxy for how the tool feels.

For agentic pipelines, both metrics compound. An agent making 10 sequential tool calls at 400ms TTFT each accumulates 4 seconds of dead time before any generation starts. Fast TTFT plus high TPS per request makes multi-step workflows significantly more responsive, since both the wait time and the generation time add up across steps.

For batch processing or offline analysis, throughput per dollar is the right metric. TTFT is irrelevant; what matters is how many tokens you can generate per dollar across many concurrent requests. GPU-based providers running at high utilization often win here, since they can batch deeply and amortize costs.

How to run this comparison yourself

Provider performance changes. New hardware ships, inference stacks get optimized, capacity gets added or constrained. The methodology above is straightforward to run against any OpenAI-compatible endpoint:

import asyncio
import time
from openai import AsyncOpenAI

async def run_concurrent_benchmark(base_url, api_key, model, concurrency, n_samples=20):
    client = AsyncOpenAI(base_url=base_url, api_key=api_key)
    prompt = "Explain the tradeoffs between synchronous and asynchronous I/O in production web services, covering thread pools, event loops, and when each model is preferred."

    async def single_request():
        start = time.perf_counter()
        first_token_at = None
        tokens = 0
        async with client.chat.completions.stream(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=512,
        ) as stream:
            async for chunk in stream:
                if chunk.choices and chunk.choices[0].delta.content:
                    if first_token_at is None:
                        first_token_at = time.perf_counter()
                    tokens += 1
        end = time.perf_counter()
        return {
            "ttft_ms": (first_token_at - start) * 1000,
            "tps": (tokens - 1) / (end - first_token_at),
        }

    results = []
    for _ in range(n_samples // concurrency):
        batch = await asyncio.gather(*[single_request() for _ in range(concurrency)])
        results.extend(batch)

    ttfts = sorted(r["ttft_ms"] for r in results)
    tps_vals = sorted(r["tps"] for r in results)
    p50 = len(ttfts) // 2
    p95 = int(len(ttfts) * 0.95)
    print(f"TTFT p50: {ttfts[p50]:.0f}ms, p95: {ttfts[p95]:.0f}ms")
    print(f"TPS  p50: {tps_vals[p50]:.0f},   p95: {tps_vals[p95]:.0f}")

Run this against each provider at the concurrency levels that match your expected production load. The numbers will matter more than any published benchmark because they reflect your actual prompts, your actual deployment topology, and the provider's actual state of capacity.

What to do with this information

A few practical takeaways:

Use single-request TPS as a quick filter for latency-sensitive use cases. If a provider can't break 100 TPS at concurrency 1 on a 70B model, it's going to struggle with real-time applications. Use concurrency-20 or concurrency-32 TPS as the filter for batch or moderate-load use cases.

Watch p95 TTFT, not p50. A p50 of 300ms looks fine. A p95 of 2,400ms means one in twenty user requests is waiting more than two seconds before seeing any text, which is a worse UX outcome than a consistent 600ms p50 with a stable p95.

Consider the model size you actually need. A 7B or 8B model at 600+ TPS on capable hardware is meaningfully faster than a 70B model at 80 TPS, and for many tasks the quality gap is smaller than the speed gap is large. The fastest 70B inference is still substantially slower than a purpose-built 8B deployment.

Try it against GeneralCompute

GeneralCompute's inference API is OpenAI-compatible and runs on custom ASIC hardware optimized specifically for token generation throughput. The benchmark harness above works directly against it by swapping the base URL:

client = AsyncOpenAI(
    base_url="https://api.generalcompute.com/v1",
    api_key="your-api-key",
)

Run the benchmarks with your own prompts and your own concurrency targets. The numbers above are a starting point; what matters is how the system performs on your specific workload from your specific location. You can get an API key at generalcompute.com.