Agent Readout

Quantization for Inference: GPTQ, AWQ, SmoothQuant, and FP8

Quantization shrinks model weights from 16-bit to 4-bit or 8-bit, cutting memory usage and speeding up inference. Here's how the major techniques work and when to use each one.

Author: General Compute
Published: 2026-03-26
Tags: inference, papers, deep-dive

Markdown body

A 70 billion parameter model stored in FP16 (16-bit floating point, the standard precision for LLMs) takes about 140GB of memory. That's two A100 80GB GPUs just to load the weights, before you even account for the KV cache and other overhead.

Quantization reduces the precision of those weights, from 16 bits down to 8 or even 4 bits per parameter. A 70B model in 4-bit takes about 35GB, fitting on a single GPU. The model runs faster (less data to move from memory) and uses less memory (room for more concurrent requests), with surprisingly little quality loss.

The catch is that naive quantization (just rounding everything to lower precision) destroys model quality. The four techniques covered here each found clever ways to quantize accurately.

## Weight-Only vs. Weight-and-Activation Quantization

Before diving into specific methods, it helps to understand the two main categories.

**Weight-only quantization** (GPTQ, AWQ) shrinks the stored model weights to 4-bit or 8-bit, but during computation, those weights get dequantized (converted back) to FP16 before the actual matrix multiplication happens. The speed benefit comes entirely from reduced memory bandwidth: reading 4-bit weights is 4x faster than reading 16-bit weights. The math itself still runs in FP16.

This is ideal for the decode phase (generating tokens one at a time), which is almost entirely memory-bandwidth-bound. You're spending most of your time reading weights, so making them smaller directly speeds things up.

**Weight-and-activation quantization** (SmoothQuant, FP8) quantizes both the weights and the input activations (the data flowing through the network), so the actual matrix multiplication runs in lower precision (INT8 or FP8) on specialized hardware (tensor cores). This speeds up both the memory transfer and the compute.

This helps most during the prefill phase (processing the input prompt), which is more compute-bound because you're processing many tokens in parallel. Faster math means faster prefill.

## GPTQ: The First Practical Large-Model Quantization

GPTQ (Frantar et al., October 2022) was the first method to make post-training quantization (quantizing after training, without retraining) work well on models with 100B+ parameters.

The core idea comes from a family of techniques called Optimal Brain Quantization. GPTQ quantizes weights one column at a time, and after quantizing each column, it adjusts the remaining unquantized columns to compensate for the error introduced. The adjustment uses second-order information (based on the Hessian matrix, which captures how sensitive the model's output is to changes in each weight) computed from a small calibration dataset.

The key practical innovation was making this process fast enough to run on large models. GPTQ can quantize a 175B parameter model in a few hours on a single GPU, which was previously impractical.

**Results:** 3-bit and 4-bit quantization with minimal accuracy loss on models up to 175B parameters. A 4-bit quantized 70B model fits on a single 80GB GPU and runs roughly 2-3x faster than the FP16 version due to reduced memory bandwidth.

**Tradeoff:** Weight-only, so the compute itself is still FP16. The speedup comes purely from less memory to read.

## AWQ: Activation-Aware Weight Quantization

AWQ (Lin et al., MIT Han Lab, June 2023) took a different approach. Instead of compensating for error after quantization, it identified which weights are most important to preserve accurately before quantizing.

The key observation: only about 1% of weight channels are "salient" (critically important for output quality), and you can identify them by looking at the activation magnitudes (how large the values flowing through the network are at each position), not the weight magnitudes. Channels that see large activations are the ones where quantization error hurts the most.

AWQ applies a mathematically equivalent scaling transformation that makes the salient channels larger (and therefore less affected by rounding) while making less important channels smaller. After this transformation, standard quantization works much better because the important information is protected.

**Results:** Generally shows less accuracy degradation than GPTQ, especially at very low bit-widths (3-bit). Won the MLSys 2024 Best Paper Award.

**Why it matters for serving:** AWQ is hardware-friendly because it doesn't use mixed-precision (which would require special handling). All weights are the same bit-width, making kernel implementation straightforward. This is why AWQ is widely supported in vLLM, TensorRT-LLM, and other serving frameworks.

## SmoothQuant: Making Activation Quantization Work

Both GPTQ and AWQ only quantize weights. SmoothQuant (Xiao et al., November 2022) tackled the harder problem of quantizing activations too, enabling W8A8 (8-bit weights and 8-bit activations) inference.

The problem with quantizing activations is that they contain outliers. A few channels in the activation tensors have values that are 10-100x larger than the rest. If you quantize to INT8 (which has a range of -128 to 127), these outliers either get clipped (destroying information) or force the entire quantization range to be so wide that the normal values lose all precision.

SmoothQuant's insight: migrate the difficulty from activations to weights. It applies a per-channel scaling factor that divides the activation outliers by a constant and multiplies the corresponding weights by the same constant. This is a mathematically equivalent transformation (the model computes the same result), but after applying it, the activations are much smoother and easier to quantize.

**Results:** Up to 1.56x inference speedup and 2x memory reduction on models like OPT-175B and BLOOM-176B with negligible accuracy loss. Because both weights and activations are in INT8, the actual matrix multiplication runs on INT8 tensor cores, which are faster than FP16 tensor cores.

**Why it's different from GPTQ/AWQ:** The speedup comes from faster math, not just less memory to read. This matters most for compute-bound workloads (large batch sizes, prefill).

## FP8: The New Standard

FP8 quantization (8-bit floating point) emerged in 2023-2024, enabled by hardware support on NVIDIA's Hopper (H100) and Ada Lovelace GPUs.

Unlike INT8 (which has a fixed range and uniform spacing between values), FP8 is a floating-point format with an exponent and mantissa, giving it a wider dynamic range. This makes it much easier to apply to both weights and activations without the outlier problems that SmoothQuant had to work around.

There are two FP8 formats: E4M3 (4 exponent bits, 3 mantissa bits, better precision) and E5M2 (5 exponent bits, 2 mantissa bits, wider range). Typically E4M3 is used for weights and forward-pass activations, while E5M2 is used for gradients during training.

**Results:** ~33% improvement in tokens/s and 8.5% lower TTFT compared to FP16 on H100s. FlashAttention-3 integrates FP8 support, achieving 1.2 PFLOPs/s for attention computation.

**Why it's winning:** FP8 is simpler to apply than INT8 quantization (fewer calibration issues), has native hardware support on modern GPUs, and the quality loss is minimal. It's rapidly becoming the default precision for inference on H100s.

## When to Use What

| Method | Precision | Type | Best For | Quality Impact |
|---|---|---|---|---|
| GPTQ | 4-bit | Weight-only | Fitting large models on small GPUs | Low |
| AWQ | 4-bit | Weight-only | Production serving, best 4-bit quality | Very low |
| SmoothQuant | W8A8 | Weight + activation | Compute-bound workloads, large batches | Very low |
| FP8 | 8-bit | Weight + activation | H100/H200 inference, general purpose | Minimal |

For most production deployments on modern hardware: use FP8 if you have Hopper GPUs, AWQ if you need 4-bit to fit the model in memory.

## How ASICs Change the Equation

Quantization techniques were developed primarily to work around GPU limitations: limited memory capacity, limited memory bandwidth, and the desire to use specialized low-precision tensor cores. Each technique is a software solution to make models fit and run faster on hardware that wasn't designed specifically for inference.

General Compute runs on inference-optimized ASICs that handle precision and memory differently at the hardware level. Our chips are designed from the ground up for the data types and access patterns that transformer inference uses, with native support for the precision formats that matter most for serving. We don't need to choose between "fits in memory" and "runs fast" because the hardware was designed with both in mind.

The result is that we can serve models at full quality and speed without the tradeoffs that GPU-based providers have to make around quantization. [Sign up at generalcompute.com](https://generalcompute.com) and use your free signup credit to try it out.

## Papers and References

- [GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers](https://arxiv.org/abs/2210.17323) (Frantar et al., 2022 -- ICLR 2023)
- [AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration](https://arxiv.org/abs/2306.00978) (Lin et al., 2023 -- MLSys 2024 Best Paper)
- [SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models](https://arxiv.org/abs/2211.10438) (Xiao et al., 2022)
- [An Investigation of FP8 Across Accelerators for LLM Inference](https://arxiv.org/abs/2502.01070) (2025)