Agent Readout
Quantization Explained: INT4, GGUF, GPTQ and What They Mean for Your Model
A practical guide to LLM quantization: what INT4, GGUF, and GPTQ actually do, how much quality you lose, and how to quantize a model yourself with llama.cpp and AutoGPTQ.
- Author
- General Compute
- Published
- 2026-06-01
- Tags
- inference, quantization, tutorial
Markdown body
If you have tried to run an open model locally, you have probably hit a wall: the model is 140GB in full precision and your GPU has 24GB of memory. Quantization is the technique that closes that gap. It shrinks the model by storing its weights in fewer bits, and done well it costs you almost nothing in output quality.
The problem is that the ecosystem is full of acronyms. INT4, INT8, GGUF, GPTQ, AWQ, Q4_K_M. They are not interchangeable, and the difference between picking the right one and the wrong one is the difference between a model that runs well and one that produces garbage. This guide explains what each term means, how much quality you actually lose, and how to quantize a model yourself.
## What Quantization Actually Does
A model's weights are just numbers. By default they are stored as 16-bit floating point values (FP16 or BF16), where each weight takes 2 bytes. A 70B parameter model in FP16 needs roughly 140GB just to hold the weights, before you account for the KV cache or activations.
Quantization stores those same weights in fewer bits. Instead of 16 bits per weight, you use 8 bits (INT8), 4 bits (INT4), or even less. The math is straightforward: going from 16-bit to 4-bit cuts memory by 4x. That 70B model drops from 140GB to about 35GB, which suddenly fits on a single 48GB GPU.
The core operation is mapping a range of floating point values onto a smaller set of integers. You take a block of weights, find the min and max, and scale them so they fit into the integer range. You store the integers plus a scale factor (and sometimes a zero point). At inference time you reverse the process to approximately recover the original values.
The word "approximately" is where all the interesting tradeoffs live. You cannot represent 65,536 distinct FP16 values with only 16 distinct INT4 values, so you lose information. The entire field of quantization research is about losing that information in the places where it matters least.
## Bit Widths: INT8, INT4, and Below
The bit width is the most important number. It directly determines how much memory you save and roughly how much quality you risk.
**INT8 (8-bit).** Half the size of FP16. INT8 is the safe choice. For most models the quality difference versus full precision is negligible, often within measurement noise on standard benchmarks. If you just want to fit a model in less memory and you do not want to think hard about it, INT8 is the default that almost always works.
**INT4 (4-bit).** A quarter the size of FP16. This is where quantization gets interesting, because the savings are large and the quality cost is usually small but not zero. A well-quantized INT4 model loses a few tenths of a point on most benchmarks and is often indistinguishable in casual use. INT4 is the sweet spot for running large models on consumer or single-GPU hardware.
**Below 4-bit.** INT3, INT2, and 1.58-bit schemes exist. They save more memory but the quality degradation becomes noticeable, and the methods needed to keep them usable get complicated. For production work, 4-bit is usually the practical floor. Going lower is research territory or a last resort when memory is extremely tight.
A useful way to think about it: the larger the model, the more aggressively you can quantize it. A 70B model at INT4 often outperforms a 13B model at FP16 while using similar memory, because the larger model has more redundancy to spare. Small models (7B and under) are more sensitive and lose more from aggressive quantization.
## Perplexity: Measuring the Damage
When people compare quantization methods, they usually report perplexity. Perplexity measures how well a model predicts a held-out text sample. Lower is better. A perfectly preserved model has the same perplexity as the original; a damaged one has higher perplexity.
The thing to watch is the delta, not the absolute number. A good INT4 quantization typically raises perplexity by a small fraction (often less than 1%) over the FP16 baseline. A bad quantization, or one pushed too far in bit width, can raise it by several percent, which shows up as the model making more mistakes, losing coherence over long outputs, or fumbling structured tasks like code and JSON.
Perplexity is a proxy, not the whole story. Two methods can have similar perplexity but behave differently on reasoning or instruction following. Always sanity-check a quantized model on tasks you actually care about, not just the perplexity number.
## The Methods: GPTQ, AWQ, and Friends
How you choose which weights to round which way is what separates the methods. Naive rounding (just clamp every weight to the nearest integer) works for INT8 but falls apart at INT4. The good methods are smarter about it.
**GPTQ (Generative Pre-trained Transformer Quantization)** quantizes weights one layer at a time, using a small calibration dataset to figure out how rounding each weight will affect the layer's output. It then adjusts the remaining weights to compensate for the error introduced by the ones it already rounded. This second-order, error-correcting approach makes GPTQ very good at 4-bit. It needs a calibration pass (a few hundred sample sequences), which takes minutes to a couple of hours depending on model size, but it only happens once.
**AWQ (Activation-aware Weight Quantization)** starts from the observation that not all weights matter equally. A small fraction of weights, the ones multiplied by large activations, carry most of the model's behavior. AWQ identifies these salient weights using activation statistics and protects them by scaling, so the important channels keep their precision while the rest get quantized aggressively. AWQ is fast to apply and tends to preserve quality well, especially for instruction-tuned models.
**SmoothQuant** targets a different problem: quantizing activations, not just weights. Activations have outliers that make them hard to quantize. SmoothQuant mathematically shifts the difficulty from activations into weights (which are easier to quantize), enabling INT8 quantization of both. It is most relevant when you want to quantize activations for throughput, not just shrink weights.
**FP8 (8-bit floating point)** is a different approach entirely. Instead of integers, it uses a floating point format with 8 bits. FP8 keeps a wider dynamic range than INT8, which makes it forgiving and well suited to newer hardware that supports it natively. It is increasingly the default for high-throughput serving on modern accelerators.
For most people quantizing a model to run it: use GPTQ or AWQ for 4-bit, and reach for GGUF (below) if you are running on llama.cpp.
## GGUF: The Format, Not the Method
GGUF is the most commonly confused term, because it is not a quantization method at all. It is a file format. GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp, the popular C++ inference engine that runs models on CPUs, Apple Silicon, and consumer GPUs.
A GGUF file packages the model weights, the tokenizer, and metadata into a single file, and it can hold weights at various quantization levels. The quantization schemes inside GGUF have their own naming convention that trips everyone up:
- **Q4_K_M** means 4-bit, "K-quant" method, medium size. The K-quants use a mix of bit widths across different parts of the model to balance quality and size.
- **Q5_K_M** is 5-bit, larger and higher quality than Q4.
- **Q8_0** is 8-bit, very close to the original.
- **Q2_K** is 2-bit, smallest and lowest quality.
The practical recommendation for most users is **Q4_K_M**. It hits a good balance of size and quality and is the most widely used GGUF variant. If you have memory to spare and want a bit more quality, go to Q5_K_M. If you are desperate for memory, Q3_K_M still works reasonably for larger models.
So when someone says "I downloaded the GGUF," they mean they downloaded a file in the llama.cpp format, quantized to some level indicated by the Q-name in the filename. GGUF (the format) and GPTQ (the method) are answers to different questions: GGUF is "what file format," GPTQ is "how were the weights quantized."
## Quantizing a Model With llama.cpp
Here is how to take a model from Hugging Face and produce a Q4_K_M GGUF file you can run locally. This uses llama.cpp's tooling.
```bash
# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
# Install the Python conversion dependencies
pip install -r requirements.txt
# Convert the Hugging Face model to a full-precision GGUF first
python convert_hf_to_gguf.py /path/to/model \
--outfile model-f16.gguf \
--outtype f16
# Quantize the f16 GGUF down to Q4_K_M
./llama-quantize model-f16.gguf model-Q4_K_M.gguf Q4_K_M
```
That last command produces the quantized file. You can then run it directly:
```bash
./llama-cli -m model-Q4_K_M.gguf -p "Explain quantization in one sentence." -n 128
```
The conversion to f16 GGUF is just a format change with no quality loss. The quantize step is where the actual bit reduction happens, and it takes a few minutes for a 7B model.
## Quantizing a Model With AutoGPTQ
If you want a GPTQ-quantized model for GPU inference (with vLLM, TGI, or Transformers), AutoGPTQ is the standard tool. It needs a calibration dataset, which is just a list of representative text samples.
```python
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
model_id = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
# 4-bit config with group size 128 (a common, well-tested setting)
quantize_config = BaseQuantizeConfig(
bits=4,
group_size=128,
desc_act=False,
)
model = AutoGPTQForCausalLM.from_pretrained(model_id, quantize_config)
# Calibration data: a few hundred representative samples
calibration = [
tokenizer(text) for text in load_my_sample_texts()
]
# Run the GPTQ quantization pass
model.quantize(calibration)
# Save the 4-bit model
model.save_quantized("llama-3.1-8b-gptq-4bit")
tokenizer.save_pretrained("llama-3.1-8b-gptq-4bit")
```
The `group_size` parameter controls how many weights share a scale factor. Smaller groups (like 128) preserve more quality at the cost of slightly more memory for the scale factors. The `desc_act` option reorders quantization by activation magnitude, which can improve quality but slows inference, so most people leave it off.
Once saved, the model loads in vLLM or Transformers like any other model, and inference runs at roughly the memory footprint you would expect from 4-bit weights.
## Choosing for Your Situation
A short decision guide:
- **Running locally on a Mac or CPU?** Use llama.cpp with a GGUF file, Q4_K_M as the default.
- **Serving on a GPU with vLLM or TGI?** Use GPTQ or AWQ at 4-bit. AWQ often edges out GPTQ on instruction-tuned models and is faster to apply.
- **Want maximum safety with moderate savings?** INT8 (or FP8 on supported hardware) barely touches quality.
- **Squeezing a very large model onto limited memory?** INT4 on the large model usually beats a smaller model at higher precision.
- **Quantizing a small model (7B or under)?** Be more conservative. These models feel the loss more, so prefer Q5 or AWQ over aggressive 4-bit.
Whatever you pick, test the quantized model on your real workload before shipping it. Benchmarks and perplexity are guides, not guarantees.
## Where Quantization Fits in Production Inference
Quantization is a memory and bandwidth optimization. It exists largely because reading model weights from GPU memory is the bottleneck during generation, and smaller weights mean less data to move per token. That is why a 4-bit model often generates faster than the same model in FP16: not because the math is cheaper, but because there is less to read.
At General Compute we think about that bottleneck from the hardware up. Our inference runs on custom ASICs designed around the memory bandwidth problem that quantization works around on GPUs, so the baseline serving speed is already high before any quantization tricks are applied. We support quantized open models so you get the memory savings and the speed of purpose-built inference hardware at the same time.
If you want to run open models like Llama, Qwen, or DeepSeek at production speed without managing quantization and infrastructure yourself, [sign up at generalcompute.com](https://generalcompute.com) and get $100 in free credit to try it.
## References
- [GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers](https://arxiv.org/abs/2210.17323)
- [AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration](https://arxiv.org/abs/2306.00978)
- [SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models](https://arxiv.org/abs/2211.10438)
- [llama.cpp](https://github.com/ggerganov/llama.cpp) and the GGUF format
- [AutoGPTQ](https://github.com/AutoGPTQ/AutoGPTQ)