Agent Readout

Distillation for Inference: How Smaller Models Learn From Larger Ones

A practical guide to knowledge distillation for production inference: what actually works, what to skip, and how to ship a smaller model without losing the behavior you cared about.

Author
General Compute
Published
2026-04-30
Tags
distillation, inference, model compression, training, production

Markdown body


If you serve LLMs in production, you have probably stared at the same tradeoff for a while now. The big model is good. The big model is also slow and expensive. The small model is fast and cheap, but it makes the kind of mistakes your users notice immediately. Distillation is the standard answer to this gap, and it actually works, but the literature on it is a mess of techniques that sound similar and behave very differently in practice.

This post is about what distillation looks like once you stop reading papers and start running training jobs. We will cover the different flavors of distillation, when each one is the right tool, and the operational details that decide whether the distilled model ends up in production or in a Slack thread titled "why we paused this project."

## What distillation actually is

Knowledge distillation is the practice of training a smaller "student" model to imitate a larger "teacher" model. The teacher has already been trained, usually at considerable cost. The student gets to skip most of that work and instead learn from the teacher's outputs, which are richer than the raw labels in your dataset.

The original framing from Hinton, Vinyals, and Dean in 2015 is still the cleanest way to think about it. A normal classifier sees a label like "this image is a cat" and learns to push probability toward "cat." A distilled student sees the teacher's full output distribution: 87% cat, 9% lynx, 3% dog, 1% everything else. Those soft targets carry information about how the teacher organizes the world. The student learns not just what the answer is, but how confident the teacher is and what the plausible alternatives are.

For LLMs the story is the same, just over vocabularies of 100k tokens instead of 1k image classes. Every time the teacher predicts a next token, it produces a probability distribution over the entire vocabulary. The student tries to match that distribution. This is far more informative than just training the student on the argmax token, because it teaches the student where uncertainty lives.

## The three flavors that matter in practice

Distillation comes in many variants in the literature. In production you will mostly run into three of them.

### Response-based distillation

The student is trained to match the teacher's output token distributions. You run the teacher across a corpus of prompts, save the logits or top-k probabilities, then train the student with a KL divergence loss against those distributions. Sometimes called soft-target distillation or logit distillation.

This is the workhorse approach for LLMs. It is cheap to set up, it composes with normal language modeling losses, and the data you generate (prompt plus teacher distribution) is reusable across multiple student training runs.

### Feature-based distillation

The student is trained to match the teacher's intermediate activations, not just the output. You pick layers in the teacher and corresponding layers in the student, and add an MSE loss between their hidden states. The "FitNets" paper from 2014 introduced this, and variants have appeared regularly since.

This works well when the student architecture is similar to the teacher's, just narrower or shallower. It struggles when the architectures diverge, because there is no natural correspondence between layers. For most LLM distillation projects you can ignore this until response-based distillation has stopped giving you gains.

### On-policy distillation

The student generates its own outputs, and the teacher scores them. The student is trained to make outputs that the teacher rates highly. This is essentially RLHF with the teacher acting as the reward model. It is more expensive than response-based distillation because you need to run the student during training and then run the teacher on the student's outputs, but it directly optimizes for behaviors the student can actually produce.

This matters more than it sounds. In response-based distillation, the student is trained on the teacher's continuations of prompts. But the student, once deployed, will be continuing its own previous tokens, not the teacher's. There is a distributional mismatch between training and inference. On-policy distillation closes that gap.

## What you actually distill

The framing of "match the teacher's outputs" hides a real decision: which outputs?

You can distill on:

- **Your existing training data.** Run the teacher over your prompts, capture distributions, train the student. This is the simplest case.
- **Synthetic data the teacher generates.** Have the teacher complete prompts (real or templated), and use those completions plus distributions as training data. Most modern small models that punch above their weight, including the Phi family and several Qwen sizes, lean heavily on this. The teacher both produces the inputs and provides the supervision.
- **Targeted distributions.** If you care about specific behaviors (math, JSON output, refusals, tool calls), generate prompts that exercise those behaviors and distill on those. This is where distillation stops being a generic compression technique and starts being a behavior transfer technique.

The "targeted" version is the one that gets the most bang for the buck in production. If you have a 70B model that handles your customer support queries well and you want to ship a 7B replacement, you should not distill on a generic web corpus. You should distill on the actual distribution of queries you serve, plus edge cases and adversarial inputs, with the teacher producing high-quality responses you can train against.

## A minimal distillation loop

Here is the shape of a basic response-based distillation training step. Real implementations have more bookkeeping, but this is the core.

```python
import torch
import torch.nn.functional as F

def distillation_step(student, teacher, batch, temperature=2.0, alpha=0.5):
    input_ids = batch["input_ids"]
    labels = batch["labels"]

    with torch.no_grad():
        teacher_logits = teacher(input_ids).logits

    student_logits = student(input_ids).logits

    # Distillation loss: student matches teacher's softened distribution.
    soft_targets = F.softmax(teacher_logits / temperature, dim=-1)
    soft_log_preds = F.log_softmax(student_logits / temperature, dim=-1)
    distill_loss = F.kl_div(
        soft_log_preds, soft_targets, reduction="batchmean"
    ) * (temperature ** 2)

    # Standard language modeling loss against the real labels.
    lm_loss = F.cross_entropy(
        student_logits.view(-1, student_logits.size(-1)),
        labels.view(-1),
        ignore_index=-100,
    )

    return alpha * distill_loss + (1 - alpha) * lm_loss
```

A few things worth noting. The temperature parameter softens the teacher's distribution, which exposes more information about the relative ordering of low-probability tokens. The `temperature ** 2` factor keeps the gradient magnitude comparable to the unscaled cross-entropy term. The `alpha` controls the mix between matching the teacher and matching the ground-truth labels, and you usually want both, because pure distillation can let the student inherit the teacher's mistakes.

In practice you will not run the teacher live during training unless your teacher is small enough that it fits on the same accelerators alongside the student. For real LLM distillation you precompute the teacher's outputs (top-k logits or full distributions) and store them. The training loop then reads those from disk. This trades storage for compute, and at scale it is almost always the right tradeoff.

## How to choose the student architecture

The student architecture decides almost everything about the final tradeoff. Some practical guidance.

**Match the teacher's tokenizer.** If the student uses a different tokenizer, you cannot do logit distillation directly because the vocabulary spaces do not align. There are workarounds, but they are painful and lossy. Pick a student that shares the teacher's tokenizer and you avoid an entire class of problems.

**Pick a student that is already good.** The student's pre-training matters more than people expect. Distilling onto a randomly initialized small model is much harder than distilling onto a well pre-trained small model. Start from a strong checkpoint of the size class you want, then distill on top.

**Aim for 5x to 20x compression.** Below 5x compression, you might as well just quantize the teacher and call it done. Above 20x compression, the student's capacity is so much smaller than the teacher's that distillation alone usually cannot close the gap. The sweet spot is somewhere in between, and it is the regime where most production wins happen: distill a 70B teacher into a 7B student, or a 7B teacher into a 1B student.

**Width vs depth matters.** A student that is shallower than the teacher loses reasoning depth. A student that is narrower loses representational capacity. For LLMs, narrower-but-similar-depth students tend to preserve behavior better than shallower-but-similar-width students. Reasoning seems to live in the depth.

## Where distillation actually wins

Distillation has clear sweet spots in production.

It works very well for **task specialization.** If you have one model that needs to handle one narrow domain (extracting fields from invoices, classifying support tickets, summarizing meeting transcripts), distillation from a frontier teacher onto a small student can preserve almost all the relevant quality at a fraction of the latency and cost. The student does not need to know how to write poetry or do calculus. It just needs to do the one thing.

It works well for **format and style transfer.** If your teacher has been carefully tuned to produce JSON in a specific schema, or to refuse certain queries in a specific tone, distillation can transplant that behavior to a smaller model more reliably than re-doing the tuning from scratch on the small model.

It works well as a **cost-cutting move on a deployed model.** When you already serve a large model and have collected real production traffic, you can use that traffic as the distillation dataset. The student is trained on exactly the distribution it will see in production. This is one of the highest-ROI uses of distillation that exists.

## Where distillation breaks down

Distillation does not magically make a 1B model as smart as a 70B model on every task. Specifically, it tends to fall short in a few places.

**Long-context reasoning.** Tasks that require holding a lot of state in working memory and chaining many steps of inference seem to need raw capacity. Distillation can transfer surface behavior but not always the underlying reasoning depth.

**Out-of-distribution inputs.** The student inherits the teacher's strengths on the distillation distribution. On inputs that fall outside that distribution, the student often degrades faster than the teacher would. This is why your distillation dataset matters so much.

**Calibration.** Distilled students often end up overconfident, especially when distilled from teachers that were themselves overconfident. If your application depends on calibrated probabilities (routing, abstention, threshold-based decisions), measure calibration on the student before you ship.

## Combining distillation with other techniques

Distillation does not exist in isolation. In production you will usually stack it with other compression techniques.

**Distillation then quantization.** Distill a 70B teacher to a 7B student, then quantize the student to 4-bit or FP8. The quality drop from quantization is usually small if the student was well-trained, and the combined speedup is much larger than either technique alone.

**Distillation as a starting point for fine-tuning.** Distill a general student from a general teacher, then fine-tune the student on your domain data. The distilled checkpoint is a much better starting point than a random small model.

**Distillation plus speculative decoding.** A distilled small model is an unusually good draft model for speculative decoding against its teacher. Because they share representations, the acceptance rate of the small model's drafts is higher than if you used an unrelated small model.

## Operational details that matter

A few things that nobody warns you about until you have shipped a distilled model and watched it misbehave.

**Distill on conversation traces, not just single turns.** If your model is used in multi-turn settings (chat, agents, tool use), distill on full traces rather than single prompt-response pairs. Single-turn distillation tends to produce models that handle the first turn well and degrade across turns.

**Watch for capability collapse.** If you distill heavily on one task, the student can lose abilities the teacher had on other tasks. Mix in some general data even if you only care about one workload, or be prepared for surprise regressions when users do something off-script.

**Evaluate on your real metrics, not benchmarks.** A distilled model can score the same as the teacher on MMLU and behave noticeably worse on the specific thing your users do. Build evals from your own production traffic and grade against those.

**Iterate on the dataset, not the loss.** Most of the meaningful gains in distillation come from improving the data, not from tweaking the loss function. New temperature schedule? Probably won't move the needle. More targeted prompts that exercise the failure modes you care about? Will move it a lot.

## Putting it together

If you are starting a distillation project today, the path that works most reliably is something like this. Pick a strong pre-trained student in the size class you want. Use the same tokenizer as the teacher. Build a distillation dataset from your real traffic plus targeted synthetic data for behaviors you want to preserve. Train with a mix of distillation loss and standard language modeling loss against any ground-truth labels you have. Evaluate on your own production metrics, not benchmarks. Quantize the result.

The output is a model that is 5x to 10x faster and cheaper than the teacher, behaves the way the teacher behaves on the workloads you care about, and runs cleanly on infrastructure built for inference rather than training. That is a genuinely useful position to be in, and it is reachable with techniques that have been stable for a while. The hard part is not the algorithm, it is the dataset and the evaluation discipline around it.

If you want to put a distilled model into production with serving infrastructure designed for low-latency inference, [General Compute's API](https://generalcompute.com) is built for exactly this kind of workload: small fast models running on hardware optimized for throughput and tail latency, with an OpenAI-compatible interface so you can drop in your distilled model behind your existing client code. The model is the artifact, the serving stack is what makes it useful.
ModeHumanAgent