Agent Readout
FP8 Training and Inference: The Precision Sweet Spot
Why 8-bit floating point hits a different point on the accuracy/throughput curve than INT8, how E4M3 and E5M2 are used in practice, and what FP8 actually buys you in production serving.
- Author
- General Compute
- Published
- 2026-04-29
- Tags
- fp8, quantization, inference, training, hopper, blackwell
Markdown body
For a long time, the default story for low-precision LLM serving was "train in BF16, quantize the weights to INT4 or INT8, hope nothing important breaks." That story is being replaced by something simpler. With Hopper and now Blackwell hardware, FP8 is a first-class numeric format both for training and for inference, and it sits in a useful spot on the precision/throughput curve. You get roughly half the memory of BF16, double the matmul throughput, and accuracy that is much closer to BF16 than INT8 typically gets you. This post is about what FP8 actually is, why two flavors exist, how training and inference use them differently, and where the format wins or loses against the alternatives. ## What FP8 actually is FP8 is an 8-bit floating point number. Like FP16 or BF16, it has a sign bit, an exponent field, and a mantissa field. Unlike INT8, it is not a uniform grid of values. The representable numbers cluster densely near zero and spread out exponentially as magnitude grows, which is exactly what you want for tensors whose values span several orders of magnitude. There are two common FP8 formats, both standardized in the OCP (Open Compute Project) FP8 spec and supported in NVIDIA Hopper and Blackwell tensor cores: - **E4M3**: 1 sign bit, 4 exponent bits, 3 mantissa bits. Range is roughly +/- 448, with finer resolution near zero. Used for weights and activations. - **E5M2**: 1 sign bit, 5 exponent bits, 2 mantissa bits. Range is roughly +/- 57344, with coarser resolution. Used for gradients and any tensor with a wide dynamic range. E4M3 sacrifices range for precision; E5M2 sacrifices precision for range. The choice between them is not a knob you tune per layer in production. It is dictated by the role the tensor plays. Weights and forward activations are bounded enough that E4M3 fits. Gradients during backward pass can have outliers many orders of magnitude away from the median, and they need E5M2 just to avoid overflowing. The E5M2 format is also the IEEE 754 binary8 format truncated. Some hardware refers to it as `float8_e5m2` and some as `binary8`; they are the same thing in practice. ## Why FP8 and not INT8 INT8 quantization has been around for years. It works. It is also picky in ways that FP8 is not. INT8 is uniform. Every value step is the same size. To represent a tensor whose values range from -50 to +50, you compute a scale, divide everything by the scale, and round to integers in the range -128 to +127. If the tensor has a few outliers at +200, you either clip them (losing information) or you stretch the scale to cover them (losing resolution everywhere else). Activation outliers in transformers are exactly this problem, and a large fraction of the quantization literature is dedicated to managing them: SmoothQuant, AWQ, GPTQ with grouped scales, per-channel quantization, mixed-precision rescue layers. FP8 sidesteps a lot of that. The exponent field gives you orders-of-magnitude coverage natively. An activation tensor with a few channels in the hundreds and most channels near 0.1 fits inside E4M3 without per-channel surgery, because the format is already logarithmic. You still apply a tensor-wide scale, often called `amax/Fmax` scaling, but the format does most of the work that quantization-aware training has to do for INT8. The tradeoff is that FP8 has fewer bits of mantissa than you might want. E4M3 with 3 mantissa bits gives you about the precision of a 4-bit uniform quantizer between adjacent powers of 2. So FP8 is not strictly better than INT8 in every case. It is better at handling dynamic range, often comparable on accuracy, and for the moment it has dedicated hardware paths that INT8 does not have on Hopper. ## Hardware support H100 Tensor Cores expose FP8 matmul at roughly 2x the throughput of BF16 matmul. The actual numbers depend on the SKU and the matmul shape, but for a square matmul on H100, BF16 runs near 990 TFLOPS, FP8 runs near 1980 TFLOPS, and FP8 with sparsity hits roughly 3960 TFLOPS. Memory bandwidth is unchanged, but every byte you move is now half a value instead of a quarter, so for memory-bound kernels (decode, in particular) you also pick up close to 2x. Blackwell pushes this further. B100 and B200 add support for FP6 and FP4 alongside FP8, with similar 2x scaling between adjacent formats. They also add per-block scaling formats (MXFP8, MXFP6, MXFP4) where the scaling factor is shared across a small block (often 32 elements) rather than across the whole tensor. The block-scaled formats give you most of the dynamic-range robustness of per-tensor scaling while letting individual blocks adapt, which matters for activations with concentrated outliers. AMD MI300X supports FP8 matmul. Intel Gaudi 3 supports FP8. The hardware story is no longer NVIDIA-only. The kernel ecosystem still leans heavily on NVIDIA tooling (Transformer Engine, cuBLASLt, FlashAttention with FP8), and that gap is more pronounced than the raw silicon gap. ## FP8 in training Training a large model in FP8 is not the same as inference. The gradients have a much wider dynamic range than the weights or activations, and a single misbehaved gradient can either underflow to zero or overflow to infinity in E4M3. The standard recipe handles this with a few moving parts: 1. Forward activations and weights are stored in E4M3 with per-tensor scales. 2. Backward gradients use E5M2 with per-tensor scales. 3. Master weights stay in BF16 or FP32. Updates accumulate in higher precision. 4. Scaling factors are updated continuously based on the running maximum of recent tensors, often called `amax history`. NVIDIA's Transformer Engine library is the reference implementation. It wraps `nn.Linear`, `nn.LayerNorm`, and the attention path so that the user-facing API stays in BF16 while the matmul kernels run in FP8 internally. The library tracks per-tensor `amax` over the last N steps, picks a scale that puts the maximum near the top of the FP8 range, and falls back to higher precision for any layer where the gradient distribution is too pathological. The actual training results from the original FP8-LM paper (Microsoft and NVIDIA, 2023) and follow-ups: GPT-class models trained from scratch in FP8 reach the same final loss as BF16 within ~0.5% of validation perplexity, with roughly 35% wall-clock speedup and 30-40% memory savings. The numbers depend on the model size and the implementation, but the headline is that FP8 training is not an accuracy compromise. It is a hardware utilization win. The catch: scale management is fiddly. If your `amax` history is too short, scales overshoot and you saturate. If it is too long, scales lag actual tensor magnitudes and you waste range. Most implementations use a 16- or 32-step rolling maximum with a small margin factor. Block-scaled formats on Blackwell remove most of this fiddliness because the scale adapts at the block level automatically. ## FP8 in inference For inference, the FP8 setup is simpler because you do not have gradients to worry about. The two main forms in production are weight-only FP8 and full FP8. **Weight-only FP8** stores the weights in E4M3 with a per-tensor or per-channel scale. Activations stay in BF16. At matmul time, weights are dequantized on the fly into BF16 inside the tensor core. You get half the weight memory and half the weight bandwidth, but the matmul itself runs at BF16 throughput. This is mostly a memory and bandwidth optimization. It helps decode (which is memory-bound) more than prefill. **Full FP8** quantizes both weights and activations to E4M3, runs the matmul in FP8, and accumulates in FP32. This is where you get the 2x compute advantage. The accuracy cost is small but nonzero: typical reports for Llama 70B class models show 0.1 to 0.3 perplexity points on standard benchmarks, with most of that recovered by calibrating activation scales against a small dataset. For chat and coding tasks, the win-rate difference against BF16 is usually inside the noise floor. KV cache in FP8 is its own thing. The KV cache for long-context serving is often the dominant memory consumer, and storing K and V in E4M3 cuts its size in half compared to BF16. This buys you longer context, larger batch size, or both. The accuracy cost is again small if you per-tensor-scale K and V at write time, and slightly larger if you do not. Most serving frameworks (vLLM, TensorRT-LLM, SGLang) support FP8 KV cache as a flag. A detail worth knowing: FP8 KV cache and FP8 attention matmul are not the same toggle. You can store the cache in FP8 and run the attention scores in BF16 (with on-the-fly dequant), which captures the memory benefit without changing the attention numerics. Or you can run the entire attention path in FP8, which captures the compute benefit on Hopper but is more sensitive to scaling. The right choice depends on whether your bottleneck is HBM capacity or attention TFLOPS. ## How FP8 compares to the alternatives Rough intuition for a 70B-class model on H100: | Format | Weight size | Decode tokens/sec (relative) | Quality (vs BF16) | |--------|-------------|------------------------------|-------------------| | BF16 | 140 GB | 1.0x | baseline | | FP8 | 70 GB | ~1.8x | within 0.3 PPL | | INT8 (W8A8) | 70 GB | ~1.7x | 0.3 to 0.8 PPL | | INT4 (AWQ) | 35 GB | ~2.5x | 0.5 to 1.5 PPL | The FP8 vs INT8 comparison is the interesting one. Their memory footprint is identical and their throughput is similar. FP8 wins on accuracy in practice because the format absorbs activation outliers without bespoke calibration. INT4 wins on memory and throughput at a noticeable accuracy cost, and is the right choice when you are bandwidth-bound on smaller GPUs. For training, FP8 is the only viable sub-BF16 format right now. INT8 training exists in research but is not production-ready. The training story is BF16 (the conservative default) or FP8 (the throughput win), and the gap between them keeps shrinking as Transformer Engine and the Blackwell formats mature. ## Where FP8 still hurts Two practical pain points. The first is calibration. Activation scales for FP8 inference are usually picked from a calibration set, similar to INT8. Pick a bad calibration set (too short, too narrow in domain, missing the long-tail distributions) and your scales are wrong, your activations clip or underflow, and accuracy drops more than it should. The fix is to use a calibration set that covers the actual production traffic distribution, not just a generic English corpus. The second is kernel coverage. FP8 matmul is well-supported. FP8 attention, especially with sliding-window or paged-attention layouts, is less mature. Most serving stacks fall back to BF16 attention even when the matmuls run in FP8, which limits the speedup to maybe 1.4x to 1.6x rather than the theoretical 2x. Closing that gap is mostly a kernel engineering problem, and it is being closed quickly, but it is not fully there yet for every attention variant in the wild. ## Why this matters for serving For latency-sensitive workloads (voice agents, real-time coding assistance, anything with a sub-second budget per turn), the FP8 throughput advantage is large enough to change the deployment shape. A model that needs two H100s in BF16 often fits on one H100 in FP8 with the same context length. A model that runs at 50 tokens/sec per request in BF16 hits 90 to 100 in FP8. KV cache memory is halved, which means more concurrent users per GPU. The net effect is that the cost-per-token at a given latency target drops by roughly 40 to 50% with full FP8 over BF16, with quality differences that fall inside benchmark noise on most chat and coding tasks. That is a much better deal than INT8 typically delivers, and it does not require the calibration headaches of INT4. If you are serving an open-weights model in production today, FP8 is the precision you should test against your actual workload before you reach for anything more aggressive. It is the boring answer that happens to be correct most of the time. If you want to see what well-tuned FP8 inference looks like end-to-end, [General Compute](https://generalcompute.com) runs models in FP8 by default on hardware tuned for low-latency serving. The OpenAI-compatible API gives you a fast path to compare against your current setup without rewriting your stack.