Agent Readout
What Is Direct Preference Optimization (DPO)? Explained Simply
DPO aligns language models to human preferences without a separate reward model or reinforcement learning. Here is how it works, how it compares to RLHF, and when to reach for IPO, KTO, or ORPO instead.
- Author
- General Compute
- Published
- 2026-06-02
- Tags
- fine-tuning, alignment, dpo, rlhf
Markdown body
If you have trained or fine-tuned a language model in the last couple of years, you have probably run into the alignment problem: a base model predicts plausible next tokens, but plausible is not the same as helpful, honest, or safe. Getting a model to behave the way people actually want it to behave is a separate step, and for a long time the standard answer was reinforcement learning from human feedback (RLHF). DPO, or Direct Preference Optimization, is a simpler way to reach roughly the same place.
This post explains what DPO is, the intuition behind the math, how it stacks up against RLHF, and the family of variants (IPO, KTO, ORPO) that have grown up around it. The goal is to give you a working mental model, not a full derivation, so you can decide whether DPO belongs in your training pipeline.
## The Problem DPO Solves
Suppose you have a base model and a dataset of human preferences. Each example is a prompt with two responses, where a human labeled one as better than the other. You want to nudge the model toward producing the kind of response humans prefer.
The classic RLHF recipe does this in three stages:
1. **Supervised fine-tuning (SFT).** Fine-tune the base model on good demonstrations so it produces reasonable responses to begin with.
2. **Reward modeling.** Train a separate model that takes a prompt and a response and outputs a scalar score predicting how much a human would like it. This reward model learns from the preference pairs.
3. **Reinforcement learning.** Use an RL algorithm, usually Proximal Policy Optimization (PPO), to update the language model so it produces responses that the reward model scores highly, while a penalty keeps it from drifting too far from the SFT model.
This works, and it is how many well-known models were aligned. But it is a lot of moving parts. You are training two models, running an RL loop that is notoriously fiddly to stabilize, sampling fresh responses during training, and tuning a KL penalty so the policy does not collapse into reward-hacking gibberish. Every one of those stages is a place where things go wrong.
DPO asks a pointed question: if the whole point is to satisfy human preferences, do we actually need the reward model and the RL loop in between? The answer turns out to be no.
## The Core Idea Behind DPO
The key insight from the DPO paper (Rafailov et al., 2023) is that the RLHF objective has a closed-form solution connecting the optimal policy to the reward function. Because of that relationship, you can rearrange the math so the reward model disappears entirely. Instead of training a reward model and then optimizing against it, you can optimize the language model directly on the preference pairs with a single supervised-style loss.
Here is the intuition without the full derivation. In RLHF, the optimal policy that maximizes reward while staying close to a reference model can be written in terms of that reward. You can invert this: the reward associated with any response is proportional to the log-ratio between your policy's probability of that response and the reference model's probability of it. So the reward is not something you need to learn separately. It is implicitly defined by how much more (or less) likely your model makes a given response compared to where it started.
Once you have that, the preference data gives you everything. For a chosen response and a rejected response to the same prompt, you want the implicit reward of the chosen one to be higher than the implicit reward of the rejected one. Plug the log-ratio expression into the standard preference model (the Bradley-Terry model, which says the probability a human prefers response A over B is a logistic function of the reward difference) and you get the DPO loss.
In plain terms, the loss does two things at once for every preference pair:
- It increases the probability the model assigns to the chosen response, relative to the reference model.
- It decreases the probability the model assigns to the rejected response, relative to the reference model.
The "relative to the reference model" part is what keeps the model from wandering off. A `beta` hyperparameter controls how strongly the model is allowed to deviate from the reference, playing the same role the KL penalty plays in RLHF.
## What the Loss Looks Like
The DPO loss for a single preference pair, where `y_w` is the winning (chosen) response and `y_l` is the losing (rejected) one, can be written like this:
```python
import torch
import torch.nn.functional as F
def dpo_loss(policy_chosen_logps, policy_rejected_logps,
ref_chosen_logps, ref_rejected_logps, beta=0.1):
# log-ratios between the policy and the frozen reference model
chosen_logratio = policy_chosen_logps - ref_chosen_logps
rejected_logratio = policy_rejected_logps - ref_rejected_logps
# we want chosen to be favored over rejected
logits = beta * (chosen_logratio - rejected_logratio)
loss = -F.logsigmoid(logits).mean()
return loss
```
Each `*_logps` term is the sum of log-probabilities the model assigns to the tokens of that response. You compute them twice: once with the model you are training (the policy) and once with a frozen copy of the starting model (the reference). The reference model never updates. It just provides a baseline so the loss measures relative movement rather than absolute probability.
That is the whole thing. No reward model, no sampling during training, no RL machinery. You run forward passes on pairs of responses you already have, compute log-probabilities, and backpropagate a logistic loss. It looks and feels like ordinary supervised fine-tuning, which is exactly why it has become popular.
## DPO vs RLHF: The Practical Differences
The two approaches aim at the same target, but they differ in ways that matter day to day.
**Complexity.** RLHF runs three training stages and keeps multiple models live during the RL phase (policy, reference, reward model, and often a value model). DPO needs the SFT model plus a frozen reference, and a single training loop. Fewer components means fewer failure modes.
**Stability.** PPO is sensitive to hyperparameters, reward scaling, and the KL coefficient. It can diverge or collapse if you get them wrong. DPO's loss is convex-ish in behavior and much closer to standard supervised training, so it tends to be more predictable.
**Compute and memory.** RLHF samples new responses from the policy during training, which is expensive, and holds several models in memory at once. DPO uses a fixed dataset of pairs and only two models. In practice DPO is cheaper to run and easier to fit on modest hardware.
**Online vs offline.** This is the real tradeoff. PPO is on-policy: it generates fresh responses and learns from them, so it can explore behaviors that are not in any static dataset. DPO is offline: it learns from a fixed set of pairs and never sees its own new generations. When your preference data does not cover the regions of behavior the model drifts into, DPO has less to work with. For many alignment tasks the offline data is good enough, but if you need the model to discover and refine genuinely new behaviors, on-policy methods still have an edge.
**Reward model reuse.** One thing RLHF gives you that DPO does not is a standalone reward model. A reward model is independently useful for things like best-of-N sampling at inference time, filtering generated data, or evaluation. If you want one of those, you may end up training a reward model regardless.
A reasonable summary: DPO gets you most of the alignment quality of RLHF for a fraction of the engineering effort, and for a lot of teams that trade is clearly worth it. RLHF remains relevant when you need on-policy exploration or a reusable reward model.
## The Variants: IPO, KTO, and ORPO
DPO kicked off a small family of methods that tweak its assumptions. Each addresses a specific weakness.
### IPO (Identity Preference Optimization)
DPO can overfit to the preference data. Because the loss keeps pushing the chosen response up and the rejected response down without a natural stopping point, the model can drive the probability gap to extremes, especially when the data is noisy or the two responses are nearly equally good. IPO (Azar et al., 2023) replaces the logistic loss with a squared-error term that targets a specific margin between chosen and rejected, rather than an unbounded "make the gap as large as possible" objective. The result is a regularized version of DPO that is more robust when preferences are weak or inconsistent. If you find DPO overfitting and collapsing onto your training pairs, IPO is the first thing to try.
### KTO (Kahneman-Tversky Optimization)
DPO requires paired data: two responses to the same prompt, with a clear winner. That data is expensive to collect and not always available. KTO (Ethayarajh et al., 2024) removes the pairing requirement. It works with individual responses each labeled simply as desirable or undesirable. The name comes from prospect theory in behavioral economics (Kahneman and Tversky), and the loss is shaped to weight gains and losses asymmetrically, mirroring how humans actually judge outcomes. The practical payoff is data flexibility: if all you have is a pile of thumbs-up and thumbs-down signals rather than carefully constructed pairs, KTO can use them directly. This matches a lot of real production feedback, where users react to single responses rather than comparing two.
### ORPO (Odds Ratio Preference Optimization)
Both DPO and RLHF assume you have already done supervised fine-tuning, so they start from an SFT model and a reference. ORPO (Hong et al., 2024) folds preference optimization into the SFT step itself. It adds an odds-ratio penalty to the standard supervised loss that discourages the model from generating rejected-style responses while it learns from the chosen ones. There is no separate reference model and no separate alignment stage. You go from a base model to an aligned model in one training run. That makes ORPO attractive when you want to simplify the pipeline as much as possible, though combining the two objectives means you give up some of the independent control you get from doing SFT and preference tuning as distinct steps.
### How to Choose
A rough decision guide:
- Start with **DPO** if you have clean paired preference data and a working SFT model. It is the well-understood default.
- Reach for **IPO** if DPO is overfitting or your preference labels are noisy.
- Use **KTO** when your feedback is per-response (desirable/undesirable) rather than paired comparisons.
- Consider **ORPO** when you want a single-stage pipeline and are willing to combine SFT and alignment.
None of these is strictly better than the others. They make different assumptions about your data and your tolerance for pipeline complexity.
## Where Inference Fits In
Alignment is a training-time concern, but it shows up at inference time in a couple of ways worth flagging. First, preference tuning changes the shape of a model's output distribution, which can affect how it behaves under sampling, with temperature, and with techniques like best-of-N. If you skipped a reward model by using DPO, you do not have one available for best-of-N reranking at serve time, so plan accordingly. Second, the value of a well-aligned model only materializes when you can actually serve it fast enough to use. A model that produces great responses but takes ten seconds to start streaming will not feel good to anyone, no matter how well it was tuned.
That second point is where serving infrastructure matters. Whatever method you use to align a model, you still need to run it in production with low latency. [General Compute](https://generalcompute.com) provides an OpenAI-compatible API on custom inference hardware, so once you have aligned an open model with DPO or one of its variants, you can deploy it and get fast responses without managing the serving stack yourself. You can check the [docs](https://generalcompute.com) to see how to point an existing client at it.
## Wrapping Up
DPO took the RLHF objective and showed that the reward model and the RL loop were not strictly necessary. By rewriting the reward in terms of the log-probability ratio between the policy and a frozen reference, it turned alignment into something that looks like supervised learning on preference pairs. That simplification is the reason DPO has become a default for a lot of open-model fine-tuning.
The variants fill in the gaps. IPO handles noisy preferences, KTO handles unpaired feedback, and ORPO collapses the whole thing into a single training stage. If you are aligning a model today, DPO is a sensible starting point, and the alternatives are there for when your data or your constraints push you somewhere else. The math underneath is less intimidating than it first appears: you are just teaching the model to prefer the responses people preferred, measured against where it started.