Agent Readout

Activation-Aware Quantization (AWQ) Deep Dive

A close look at how AWQ picks salient weight channels, applies per-channel scaling, and why it consistently beats round-to-nearest 4-bit quantization for LLM inference.

Author: General Compute
Published: 2026-04-28
Tags: quantization, awq, inference, llm, optimization

Markdown body

Most quantization writeups stop at "we round the weights to 4 bits and the model still works." That is fine as a marketing line, but it hides the part that actually matters: which weights you keep at higher precision, how you choose them, and why a small amount of per-channel scaling can recover almost all of the lost accuracy. AWQ, short for Activation-aware Weight Quantization, is built around that question. This post goes through the method in detail, including the math, the calibration step, the kernel implications, and the places where AWQ behaves better or worse than the alternatives.

If you only know AWQ as "the format my GGUF or vLLM model uses," this should fill in the parts in between.

## The setup: why naive 4-bit quantization fails

A linear layer in a transformer computes `Y = X W`, where `X` has shape `[batch * seq, in_features]` and `W` has shape `[in_features, out_features]`. Quantizing the weights means replacing `W` with a low-bit approximation `W_q` such that `W ≈ s * W_q` for some scaling factor `s`. With 4-bit integers, you have 16 possible values per weight, and the scale lets you cover a useful range.

Round-to-nearest (RTN) is the simplest version. For each output channel (or group of channels), you find the maximum absolute value, divide by 7 (for a signed 4-bit range of -8 to 7), and round. It works well for small models. It falls apart for large ones, especially after roughly 7B parameters, because of how the activations look.

The activations going into a transformer linear layer are not uniformly distributed. A small fraction of input channels carry outlier values that are 10 to 100 times larger than the rest. These outlier channels dominate the layer output. If you treat all weight channels the same way during quantization, you compress the salient ones and the unimportant ones with equal aggression, and the salient ones lose more in absolute terms because they were doing more work.

The earlier fix for this was GPTQ, which uses a second-order error correction loop based on the Hessian of the layer's reconstruction loss. GPTQ is good. It is also slow to calibrate, hard to debug, and tightly coupled to the order in which you process columns. AWQ takes a different and simpler route.

## The AWQ insight

The AWQ paper from MIT and SJTU starts with a small experiment. Take a quantized LLaMA model. Identify the top 1 percent of weight channels by activation magnitude. Keep those at FP16 and quantize the rest to 4 bits. The perplexity gap to the full FP16 model almost disappears. Keep the top 0.1 percent and you still recover most of the loss.

The implication: not all weight channels matter equally, and the ones that matter are exactly the ones whose corresponding input activations are large. That is the activation awareness in the name. The signal that tells you which weights to protect lives in the activations, not in the weight magnitudes themselves.

You could just keep those channels in FP16. That works, but mixed-precision storage is annoying. The kernels are messier, the memory layout is weird, and you lose some of the throughput advantage of pure INT4. AWQ avoids that by doing something cleaner: instead of keeping salient channels at higher precision, it scales them up before quantization and scales the corresponding input channels down at inference time. Mathematically the layer output is unchanged, but in the quantized representation those salient weights now have more bits of effective precision because they fall on the high end of the quantization grid.

## The math, more carefully

Consider a single input channel `i` going into a weight matrix `W`. Multiply that input by a scale `s_i > 1`, and divide the corresponding row of `W` by `s_i`. The product `X W` is unchanged:

```
Y = (X * diag(s)) * (diag(1/s) * W)
```

Now quantize `diag(1/s) * W` to INT4 instead of `W`. The salient rows of `W`, the ones aligned with the channels carrying large activations, have been divided by a value greater than 1. Their absolute magnitudes are smaller, so when you compute the per-group scale during quantization, those rows occupy more of the dynamic range and round less aggressively in relative terms.

At inference time, `X * diag(s)` is just a per-channel multiplication on the input side, which is cheap and can be folded into the previous LayerNorm or absorbed into the previous projection. The quantized weights are stored as INT4 plus a per-group scale and zero point, exactly the same format you would use for plain RTN.

There is no mixed precision in storage, no special outlier matrix on the side, no second-order error solver. The kernel can be a vanilla INT4 matmul.

## Choosing the scales

The interesting part is picking `s`. Too aggressive and you blow up the dynamic range of unimportant channels and quantize them poorly. Too conservative and you do not protect the salient ones enough. AWQ frames this as a small grid search over a single scalar.

The procedure:

1. Run a calibration set of around 128 samples through the model and collect the average per-channel magnitude `a_i` of activations going into the layer.
2. Define a per-channel scale `s_i = a_i^alpha`, where `alpha` is a single hyperparameter shared across the layer.
3. For a grid of `alpha` values in `[0, 1]`, perform the equivalent transformation, quantize the weights, and measure reconstruction error against the FP16 layer output.
4. Pick the `alpha` that minimizes the reconstruction loss.

A typical search uses 20 values of alpha. The whole search runs in seconds per layer because each iteration is just a matmul and a quantization pass, no gradients. There is no loop over weight columns, no Hessian, and no per-tensor optimization. That is why AWQ calibrates in minutes for a 70B model where GPTQ takes hours.

The choice of `alpha` matters more than people sometimes realize. With alpha = 0, every channel gets a scale of 1, which is plain RTN. With alpha = 1, the scales follow the activation magnitudes directly, which over-protects outlier channels and crushes everyone else. The sweet spot is usually somewhere between 0.5 and 0.8 depending on the layer.

## Group sizes, zero points, and the practical layout

AWQ in practice uses group quantization. A group is a contiguous set of weights along the input dimension, typically 64 or 128 elements wide, that share a single scale and zero point. Group quantization is a compromise between per-channel (best accuracy, more metadata) and per-tensor (least metadata, worst accuracy). At group size 128, a 7B model carries roughly 0.5 GB of metadata on top of the 3.5 GB of INT4 weights, which is fine.

The zero point is asymmetric. AWQ stores both a scale and an offset per group, which lets it represent distributions that are not centered around zero. This matters more than you would expect for FFN layers, where the weights of the up and gate projections often have a noticeably skewed distribution.

The bit-packing layout is interleaved to match the access pattern of common INT4 matmul kernels. Two 4-bit values are packed into one byte, but the order is shuffled so that a single 32-bit load can fetch eight values that get processed together. This is why you cannot just dump AWQ weights into an arbitrary INT4 kernel; you need a kernel that knows the packing convention. The original `llm-awq` repo ships kernels in CUDA, and vLLM, TGI, and TensorRT-LLM have all adopted compatible variants.

## How AWQ compares to GPTQ

The two methods solve the same problem and end up at similar accuracy on most benchmarks. The differences are mostly operational.

Calibration speed. AWQ is roughly 5 to 20 times faster to calibrate than GPTQ for the same model. On a single A100, GPTQ can take 4 to 6 hours for a 70B model. AWQ finishes in 20 to 40 minutes.

Memory during calibration. GPTQ needs to materialize the Hessian for each layer, which is a `[in_features, in_features]` matrix in FP32. For a 70B model with `in_features` around 8192, that is 256 MB per layer, plus working memory for the inverse. AWQ only needs activation statistics and FP16 layer outputs, which are much smaller.

Robustness. GPTQ is sensitive to the calibration set distribution. If your calibration data does not match the deployment distribution, GPTQ can over-correct on patterns that do not generalize. AWQ is less sensitive because the search space is one-dimensional per layer.

Accuracy ceiling. On well-tuned 4-bit settings with group size 128, GPTQ and AWQ are within 0.1 perplexity points on most LLaMA-class models. AWQ tends to do better on instruction-tuned models with more skewed activations, GPTQ tends to do better on base models, but both are close.

There is a third option, SmoothQuant, which uses a similar input-output rescaling trick but for INT8 activations and INT8 weights. SmoothQuant is what you want for compute-bound INT8 inference. AWQ is what you want for memory-bound INT4 inference, which is the regime almost all decoder workloads sit in.

## When AWQ disappoints

A few cases to know about.

Models with very long input sequences and unusual activation patterns sometimes break the calibration. If you calibrate on short prompts and serve long prompts, the activation statistics shift, and the chosen scales no longer reflect the deployment regime. Recalibrating on representative long-context samples fixes this.

Mixture-of-Experts models are tricky. Each expert has its own activation distribution, and routing means that any individual sample only fires a few experts. Getting reliable per-expert calibration statistics needs a larger and more diverse calibration set. Most serving stacks default to 512 or 1024 samples for MoE models instead of the usual 128.

Quantizing the attention projections is more error-prone than quantizing the MLP. The attention output projection in particular often shows higher quantization error because its activations are the result of a softmax-weighted sum and have less structure than MLP activations. Some implementations use a smaller group size, like 32, just for attention layers.

Stacking AWQ on top of LoRA-merged weights is fine in principle but you have to do the merge first, then calibrate. Calibrating before merging gives you scales that reflect the base model's activation pattern, not the fine-tuned model's, and you lose accuracy.

## The kernel side

AWQ's value at inference time comes from being a clean INT4 format with no mixed precision. The kernel is the same shape as any other group-wise INT4 matmul: load packed weights, dequantize on the fly into shared memory, do the matmul against FP16 activations.

For decode, where the batch size is small and the workload is memory-bound, AWQ buys you roughly a 3x speedup over FP16 on the linear layers, which matches the bandwidth ratio between FP16 (16 bits per weight) and INT4 (4 bits per weight, plus scale metadata). For prefill, where the workload is compute-bound, the speedup is closer to 1.5x because the kernel still has to materialize FP16 multiplies on the activation side.

You also get the memory footprint reduction, which is often more important than the speedup. A 70B model in FP16 is 140 GB. In AWQ INT4 with group size 128, it is around 38 GB, which fits on a single 48 GB GPU with room for KV cache and a reasonable batch size. That changes the deployment story more than the kernel speedup does.

## Calibration data, briefly

People often ask what to calibrate on. The original AWQ paper uses 128 samples from Pile or C4. In practice, a small mix of representative deployment data works better. If your model serves chat, calibrate on chat. If it serves code, calibrate on code. The activation statistics shift between these regimes, and the scales follow.

The number of samples does not need to be large. The optimization is not learning anything; it is computing per-channel statistics. 128 samples of around 2048 tokens each is enough for stable statistics on a 70B model. Going to 1024 samples helps for MoE.

## Putting it together

AWQ is, in the end, a fairly simple idea wrapped in careful engineering. Find the input channels that carry large activations, scale the corresponding weight rows down before quantization, and store the per-group scales next to the INT4 weights. The kernel is plain. The calibration is fast. The accuracy is competitive with GPTQ at a fraction of the work. For most production 4-bit deployments today, AWQ is the format that ends up in the model weights directory, and it is worth knowing why.

If you want to try AWQ on your own workload, the `llm-awq` repo is the reference implementation, and most major inference stacks (vLLM, TGI, TensorRT-LLM) load AWQ checkpoints natively. Calibrate on data that resembles your deployment, pick a group size of 128 unless accuracy says otherwise, and verify perplexity on a held-out slice before shipping.

If you are running fast 4-bit inference at scale and want to push throughput further, take a look at General Compute's API for what custom ASIC infrastructure looks like under the same OpenAI-compatible interface you already use.