H100 vs H200 vs B200: Which GPU Is Best for LLM Inference?
If you're provisioning GPU infrastructure for LLM inference, the choice between an H100, H200, or B200 comes down to a few variables that the spec sheets don't make obvious. All three are high-end NVIDIA data center GPUs. But for inference specifically, what matters most is memory bandwidth, memory capacity, and cost -- not raw FLOPs. This post breaks down the differences, explains why they matter for token generation, and walks through how to think about cost per token at different model sizes.
What actually limits token generation speed
LLM inference has two phases: prefill and decode. Understanding which hardware resource each phase needs changes how you interpret any benchmark.
Prefill is the processing of your input prompt. The model runs a forward pass over all input tokens at once, which is highly parallelizable. This phase is compute-bound: the main bottleneck is floating-point operations per second (FLOPs). A prompt of 1,000 tokens takes roughly 10x more prefill compute than a 100-token prompt.
Decode is the token-by-token generation that follows. For each output token, the model reads the full set of model weights and the KV cache from memory, performs one forward pass, and produces one token. This phase is memory bandwidth-bound. The GPU spends most of its time reading weights, not computing. Compute headroom sits largely idle during decode.
This means: for interactive chat and real-time applications, where most time is spent in decode, memory bandwidth is the variable that directly translates to tokens per second. FLOPs matter mostly for throughput workloads where you batch many requests together and spend a meaningful fraction of time in prefill.
Spec comparison
Here are the relevant numbers for inference workloads across the three current-generation NVIDIA options.
| Spec | H100 SXM5 | H200 SXM5 | B200 SXM6 | |---|---|---|---| | Memory | 80 GB HBM3 | 141 GB HBM3e | 192 GB HBM3e | | Memory bandwidth | 3.35 TB/s | 4.8 TB/s | 8.0 TB/s | | FP16 peak (dense) | 989 TFLOPS | 989 TFLOPS | 2.25 PFLOPS | | FP8 peak | 3.96 PFLOPS | 3.96 PFLOPS | 9.0 PFLOPS | | FP4 peak | -- | -- | 18.0 PFLOPS | | NVLink bandwidth | 900 GB/s | 900 GB/s | 1,800 GB/s | | TDP | 700 W | 700 W | 1,000 W |
A few things stand out here.
The H200 and H100 have identical compute (FP16 and FP8 FLOPs are the same). The H200 is a memory upgrade to the H100: 41 GB more capacity and 43% more bandwidth. For inference, especially decode, the H200 is straightforwardly faster than the H100 at the same model size. The upgrade is essentially all memory subsystem.
The B200 is a full architecture change (Blackwell vs Hopper). It has significantly more compute across all precisions, including new FP4 support, and doubles NVLink bandwidth vs the H100/H200. For inference, the 8 TB/s memory bandwidth (2.4x the H100) is the headline number.
Tokens per second: real workload estimates
These numbers are based on measured throughput for Llama-class models on each GPU, using FP8 quantization (which preserves quality well and is supported on all three GPUs natively).
Single-request decode speed, Llama 3.1 70B FP8
| GPU | Tokens/s (single request) | Approx. occupancy per GPU | |---|---|---| | H100 SXM5 | ~115-135 | ~70 GB (weights + KV) | | H200 SXM5 | ~160-185 | ~70 GB on 141 GB GPU | | B200 SXM6 | ~300-360 | ~70 GB on 192 GB GPU |
The B200's advantage is roughly proportional to its bandwidth advantage over the H100 (2.4x bandwidth, roughly 2.3-2.5x tokens/s). The H200 advantage over the H100 is roughly 40%, which tracks with its 43% bandwidth increase. These numbers assume the workload is bandwidth-bound, which single-request decode at 70B is.
Llama 3.1 405B FP8 -- single DGX node (8 GPUs)
| Node | FP8 weights size | Bandwidth per GPU | Tokens/s (8-way tensor parallel) | |---|---|---|---| | 8x H100 SXM5 | ~405 GB | 3.35 TB/s | ~35-45 | | 8x H200 SXM5 | ~405 GB | 4.8 TB/s | ~50-65 | | 8x B200 SXM6 | ~405 GB | 8.0 TB/s | ~100-130 |
At 405B with 8-way tensor parallelism, NVLink bandwidth also becomes relevant. H100 and H200 share the same 900 GB/s NVLink bandwidth. The B200 doubles this to 1,800 GB/s, which helps sustain higher per-token throughput when the all-reduce operations between GPUs become the bottleneck.
Memory capacity and model fit
Memory capacity determines which models you can run at all on a given configuration, and with how much room left for KV cache (which determines your maximum batch size and sequence length).
H100 SXM5 (80 GB):
- Llama 3.1 70B FP8: ~35 GB weights, ~45 GB remaining for KV cache
- Llama 3.1 70B BF16: ~140 GB -- needs 2 GPUs
- Llama 3.1 405B FP8: ~202 GB -- needs 4-6 GPUs
- Qwen 2.5 72B FP8: ~36 GB, fits comfortably on one GPU
H200 SXM5 (141 GB):
- Llama 3.1 70B BF16: ~140 GB -- fits on a single GPU, barely; leaves little KV cache room
- Llama 3.1 70B FP8: fits easily with ~106 GB for KV cache
- Llama 3.1 405B FP8: ~202 GB -- needs 2 GPUs
- Long-context workloads benefit substantially since the extra capacity goes directly to KV cache
B200 SXM6 (192 GB):
- Llama 3.1 405B FP8: ~202 GB -- still needs 2 GPUs, but barely; 8-way tensor parallel is no longer required
- The larger memory capacity reduces the number of GPUs required for large models, which cuts NVLink all-reduce overhead and can improve effective tokens/s
Memory capacity matters more than it sounds for production workloads. KV cache scales with batch size times sequence length. If you're running at 8k context with batch sizes above 32, the KV cache itself can exceed 20-30 GB for a 70B model. Tight memory means you either cap batch size (hurting throughput) or cap sequence length (hurting capability).
Cost per token analysis
Hardware specs only matter if the economics work. Here's a rough cost-per-token breakdown for a 70B FP8 model at continuous batch inference, using spot/on-demand cloud pricing as of mid-2025.
| GPU config | Approx. hourly cost | Tokens/s (batched) | Cost per million tokens | |---|---|---|---| | 1x H100 SXM5 | ~$3.00/hr | ~800 | ~$1.04 | | 1x H200 SXM5 | ~$3.80/hr | ~1,100 | ~$0.96 | | 1x B200 SXM6 | ~$6.50/hr | ~2,200 | ~$0.82 |
These batched throughput numbers assume moderate concurrency (16-32 requests) where the system is running at decent utilization but not at the hardware ceiling. They'll vary depending on sequence lengths, batch composition, and serving stack efficiency.
A few observations worth noting.
The H200 and B200 have better cost-per-token than the H100 even at higher hourly rates, because their throughput improvements outpace the price premium. A 27% higher H200 price with 38% higher throughput nets out positive for cost efficiency. The B200 is roughly 2x the H100's hourly cost, with roughly 2.7x the throughput at moderate batch sizes.
However, these cost comparisons shift when you factor in minimum commitment sizes. Many cloud providers require 8-GPU DGX reservations for H200 and B200 nodes. If you need one GPU worth of capacity, you might end up provisioning (and paying for) eight. At that scale, the B200's efficiency advantage needs to offset a fixed infrastructure floor, which only works if your utilization is high.
Multi-GPU scaling considerations
Once you exceed a single GPU's memory capacity, you need tensor parallelism to split the model across multiple GPUs. The efficiency of that split depends on NVLink bandwidth relative to the all-reduce communication volume.
For 70B FP8 inference on 2 GPUs, NVLink communication is modest -- a few GB/s of all-reduce per forward pass. All three GPU generations handle this without saturation.
For 405B inference across 8 GPUs, it matters more. With H100/H200 (900 GB/s NVLink), the all-reduce operations start consuming a meaningful fraction of available bandwidth at high batch sizes. With B200 (1,800 GB/s NVLink), there's more headroom before NVLink becomes the bottleneck.
A practical rule: if you're planning to run 70B models, a single H200 or B200 handles them without tensor parallelism. If you're running 405B or larger, you'll use multiple GPUs regardless, and the B200's doubled NVLink bandwidth becomes a real advantage.
For most teams serving 7-70B models, multi-GPU is not a daily concern. The 80 GB H100 fits the most common model sizes. The H200 and B200 mostly help by reducing the need for multi-GPU configurations in the first place.
When each GPU makes sense
H100 SXM5 is a reasonable choice when:
- You're deploying 7-70B models in FP8 or INT8
- Your workloads don't require more than 80 GB of memory
- You're on a budget and can tolerate slightly lower throughput
- You need battle-tested hardware with years of production optimization in serving stacks (vLLM, TensorRT-LLM, SGLang all have extensive H100 tuning)
H200 SXM5 is a better fit when:
- You're running large batches with long sequences where KV cache is the constraint
- You're serving 70B models in BF16 and want to fit on a single GPU
- The cost premium is acceptable and you want a straightforward upgrade path from H100 tooling
- You need more memory headroom without moving to a larger multi-GPU configuration
B200 SXM6 makes sense when:
- You're running 405B+ models and want fewer GPUs per inference node
- Maximum tokens/s per GPU is the primary goal and you'll fully utilize the hardware
- You're doing long-context inference where both the bandwidth and memory capacity improvements compound
- Your workloads have enough concurrency to amortize the higher infrastructure cost
If you're evaluating for future scale, the B200's FP4 support is worth noting. FP4 quantization can theoretically halve model weights size again (vs FP8), enabling single-GPU inference for models that currently require 2-4 GPUs. Serving stacks don't yet fully support FP4, but it's the direction the hardware is designed for.
A note on real-world vs benchmark numbers
The benchmarks in this post are estimates from production hardware observations and vendor-published numbers, not from a controlled test environment. Your actual numbers will vary based on:
- Serving stack: vLLM, TensorRT-LLM, SGLang, and custom runtimes extract different efficiency from the same hardware. Some stacks are more optimized for the B200 than others as of this writing.
- Batch composition: Variable sequence lengths in a real traffic mix behave differently than fixed-length benchmark prompts.
- Quantization method: AWQ, GPTQ, and FP8 each have different memory and bandwidth profiles even at the same bit width.
- KV cache pressure: Benchmarks with short sequences miss the memory pressure of real long-context workloads.
The ratio of improvement between GPU generations is fairly consistent across conditions, even when absolute numbers shift. If you see the H200 perform 40% better than the H100 in one benchmark, expect a similar ratio in your workload. But the absolute tokens/s will depend on your specific setup.
Try it with GeneralCompute
GeneralCompute's infrastructure is purpose-built for inference speed, optimized at the hardware and software stack level for token generation throughput. If you're benchmarking against GPU-based providers, the comparison goes beyond which GPU is in the box -- serving stack efficiency matters as much as raw hardware specs.
You can run the same throughput benchmarks from the LLM throughput comparison methodology against GeneralCompute's API:
from openai import AsyncOpenAI client = AsyncOpenAI( base_url="https://api.generalcompute.com/v1", api_key="your-api-key", )
Get an API key at generalcompute.com and run your own benchmarks against your actual prompts and concurrency levels.