Agent Readout

GeneralCompute vs vLLM: Throughput, Latency, and Cost Benchmarks

A head-to-head comparison of vLLM self-hosted on H100s versus GeneralCompute's managed inference API: full methodology, throughput and latency numbers, and a total cost of operations breakdown.

Author: General Compute
Published: 2026-06-11
Tags: vllm, benchmarks, inference speed, throughput, latency

Markdown body

If you search "vLLM vs" followed by almost any inference provider, you will find plenty of opinions and very few numbers with a methodology attached. That is a problem, because the comparison people actually want to make is not framework versus framework. It is "should I run vLLM on GPUs I rent or own, or should I pay a managed API per token?" Those are different products with different cost structures, and comparing them honestly takes more care than a single tokens-per-second screenshot.

This post is our attempt to do that comparison properly. We benchmarked vLLM on rented H100s against the GeneralCompute API across throughput, time to first token, per-request generation speed, and cost per million tokens. We are publishing the full setup so you can reproduce the numbers yourself, and we will be upfront about where vLLM wins, because it does win in some configurations.

## What We Are Actually Comparing

First, the framing matters. vLLM is an open-source serving framework. It is excellent software: PagedAttention, continuous batching, prefix caching, and speculative decoding support have made it the default choice for self-hosting, and the project moves fast. When we benchmark "vLLM," we are really benchmarking vLLM plus the GPUs you run it on, plus the engineering time you spend operating it.

GeneralCompute is a managed inference API running on custom ASIC infrastructure. You do not pick hardware, tune batch sizes, or manage autoscaling. You send OpenAI-compatible requests and pay per token.

So the comparison has three layers:

1. **Raw performance**: tokens per second and latency for the same model and the same workload.
2. **Cost**: dollars per million tokens, including the utilization problem that self-hosting cannot escape.
3. **Operations**: the engineering cost that never shows up in a benchmark table.

## Methodology

Everything below uses the same workload definition so the numbers are comparable.

- **Models**: Llama 3.1 8B Instruct and Llama 3.1 70B Instruct, both in FP8. We chose these because they are the most commonly self-hosted models and both are available on GeneralCompute.
- **vLLM setup**: vLLM 0.8.x with default continuous batching, FP8 quantization, and prefix caching enabled. The 8B model ran on a single H100 SXM (80GB). The 70B model ran on 4x H100 with tensor parallelism. Instances were rented at on-demand cloud GPU rates of $2.99 per H100 hour, which is a representative mid-market price as of mid 2026.
- **Workload**: requests with 1,024 input tokens and 256 output tokens, which is close to the median shape we see in production chat and agent traffic. Load was generated with a standard open-source benchmark client sweeping concurrency from 1 to 256 simultaneous requests.
- **Metrics**: time to first token (TTFT) at p50 and p99, per-request output tokens per second (TPS), and aggregate output throughput across the whole server.
- **GeneralCompute setup**: the public API, no reserved capacity, measured from a client in the same cloud region as our nearest endpoint.

One caveat to state plainly: we make the hardware GeneralCompute runs on, so we are not a neutral party. That is exactly why the methodology is spelled out. If your numbers come out different, we want to hear about it.

## Throughput: Where vLLM Holds Its Own

Aggregate throughput is vLLM's home turf. Continuous batching exists to keep GPUs saturated, and at high concurrency it does its job well.

| Configuration | Concurrency | Aggregate output throughput |
|---|---|---|
| vLLM, Llama 8B, 1x H100 | 256 | ~11,200 tok/s |
| vLLM, Llama 70B, 4x H100 | 256 | ~3,400 tok/s |
| GeneralCompute, Llama 8B | 256 | ~38,000 tok/s |
| GeneralCompute, Llama 70B | 256 | ~14,500 tok/s |

The GeneralCompute numbers are higher, but read them carefully: an API has effectively elastic capacity behind it, so "throughput at 256 concurrent requests" measures whether the service degrades under your load, not the limit of a fixed hardware footprint. The fair takeaway is that a well-tuned vLLM deployment achieves strong aggregate throughput for its hardware, and an H100 running vLLM at full batch is a genuinely efficient machine.

The catch is the word "full." Aggregate throughput numbers assume you have enough traffic to keep the batch full around the clock. Most teams do not, and that assumption is where the cost section below gets interesting.

## Latency: TTFT and Per-Request Speed

Throughput is what the GPU experiences. Latency is what your users experience, and here the two stacks diverge sharply.

| Configuration | TTFT p50 | TTFT p99 | Per-request TPS (at 64 concurrent) |
|---|---|---|---|
| vLLM, Llama 8B, 1x H100 | 142 ms | 890 ms | 48 tok/s |
| vLLM, Llama 70B, 4x H100 | 310 ms | 2,400 ms | 22 tok/s |
| GeneralCompute, Llama 8B | 91 ms | 180 ms | 740 tok/s |
| GeneralCompute, Llama 70B | 118 ms | 230 ms | 285 tok/s |

Two things stand out.

The first is per-request generation speed. Continuous batching trades individual request speed for aggregate throughput: the more requests share the GPU, the slower each one decodes. At 64 concurrent requests, a single user on the vLLM 70B deployment sees about 22 tokens per second, which means a 256-token response takes roughly 12 seconds to finish streaming. The same request on GeneralCompute finishes in under a second. For chat UIs this is a comfort difference. For agent loops that run 10 to 15 sequential model calls per task, it is the difference between a 15 second task and a 3 minute task.

The second is p99 TTFT. vLLM's tail latency degrades under load because new requests queue behind prefill work for the batch already in flight. Chunked prefill helps, and tuning helps more, but a fixed pool of GPUs has a queue when traffic spikes. This is not a vLLM flaw. It is the physics of fixed capacity, and when teams who migrate tell us what pushed them to switch, the p99 comes up far more often than the median.

If your traffic is batch-shaped (overnight ETL, document processing, evals), none of this matters and you should weight the throughput section heavily. If your traffic is interactive, the latency table is the one that predicts what your users feel.

## Cost: The Utilization Problem

Here is the naive math that makes self-hosting look cheap. One H100 at $2.99/hour running Llama 8B at 11,200 tok/s aggregate produces about 40 million output tokens per hour, which works out to roughly $0.07 per million output tokens. That undercuts every API on the market, ours included.

The naive math assumes 100% utilization, 24 hours a day. Real interactive traffic is diurnal and spiky. Across the deployments we have seen migrate to us, sustained utilization on self-hosted inference clusters typically lands between 20% and 40%, because you must provision for peak and the peak-to-trough ratio for a typical product is 3x to 5x. Autoscaling helps less than you would hope: model weights take minutes to load, so scaling reactively means eating cold starts during your highest-traffic moments, and most teams keep a buffer warm instead.

Here is the same math at realistic utilization, including a modest allowance for the engineering time to operate the cluster (we used 15% of one engineer at a fully-loaded $200k/year, which is conservative for a production deployment with on-call):

| Scenario | Effective cost per 1M output tokens (Llama 8B) |
|---|---|
| vLLM, 100% utilization, no ops cost | $0.07 |
| vLLM, 40% utilization | $0.19 |
| vLLM, 40% utilization + ops allowance at 50M tok/day | $0.26 |
| vLLM, 20% utilization + ops allowance at 50M tok/day | $0.45 |
| GeneralCompute API, pay per token | $0.10 |

For the 70B model the spread is wider, because the 4x H100 footprint costs $12 per hour whether it is busy or not, and a quiet weekend burns $576 of idle GPU.

The honest summary of the cost picture:

- **At very high, steady utilization, vLLM on rented GPUs is cheaper per token.** If you run saturated batch workloads around the clock, self-hosting wins on unit cost and it is not close.
- **At typical interactive utilization, the managed API is cheaper**, before you even count latency or engineering time.
- **The crossover point in our modeling sits around 60 to 70% sustained utilization**, which very few interactive products achieve.

## When You Should Run vLLM

We sell the alternative, so take this section as the steelman it is meant to be. vLLM is the right choice when:

- **Your workload is offline and batch-shaped.** Saturated GPUs running evals, synthetic data generation, or document pipelines hit the utilization numbers where self-hosting wins.
- **You have hard data residency or air-gap requirements.** If the tokens cannot leave your VPC or your building, a managed API is off the table regardless of price.
- **You run custom or heavily fine-tuned models that no provider hosts.** vLLM will serve almost any architecture on Hugging Face. A managed API serves its catalog. (GeneralCompute does host custom fine-tunes, but if you are iterating on exotic architectures daily, local control is worth a lot.)
- **You already own the GPUs.** Sunk hardware changes the math completely. The marginal cost of using idle owned H100s is power, and power is cheap.

## When the Managed API Wins

The API is the right choice when:

- **Per-request speed matters.** Voice agents, coding agents, and interactive products live and die on TTFT and streaming speed, and a batched GPU server fundamentally trades those away.
- **Your traffic is spiky or growing.** Paying per token means peak capacity is someone else's problem, and a launch-day traffic spike is a billing event rather than an outage.
- **Your team is small.** Operating vLLM in production means owning quantization choices, batch tuning, GPU driver issues, autoscaling, and on-call. That work is real even when it is going well.
- **Your utilization is below roughly 60%.** Per the math above, idle GPUs are the most expensive GPUs.

## Reproducing These Numbers

Benchmarks you cannot reproduce are advertising. The vLLM side of this post can be reproduced with vLLM's own benchmark script:

```bash
python benchmarks/benchmark_serving.py \
--backend vllm \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 256 \
--max-concurrency 64 \
--num-prompts 1000
```

The GeneralCompute side works with the same script pointed at our OpenAI-compatible endpoint:

```bash
python benchmarks/benchmark_serving.py \
--backend openai \
--base-url https://api.generalcompute.com/v1 \
--model llama-3.1-8b-instruct \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 256 \
--max-concurrency 64 \
--num-prompts 1000
```

Sweep the concurrency flag to build the full curves. If your results disagree with ours in either direction, we would genuinely like to see them.

## Bottom Line

vLLM is the best open-source serving stack available, and for saturated batch workloads on hardware you already have, it is the most cost-effective way to run open models. For interactive products, the combination of per-request latency, tail behavior under load, and the utilization tax on fixed GPU capacity means a fast managed API usually delivers a better product at a lower all-in cost.

If you want to check the second half of that claim, the [GeneralCompute API](https://generalcompute.com) is OpenAI-compatible, so pointing your existing benchmark (or your existing app) at it is a one-line change. Run the numbers on your own traffic shape. That is the only benchmark that actually matters for your decision.