When to use preference tuning
Consider using preference tuning when:- You have ranked pairs of responses for the same prompt. Standard supervised fine-tuning (SFT) only learns from a single target completion per example. DPO learns from the gap between good and bad (preferred and dispreferred) responses.
- You’re polishing a model that already works. DPO is most effective on top of an SFT-tuned checkpoint. Do SFT first, then continue with DPO from that checkpoint.
- You want to reduce specific failure modes. Pair the failure as
non_preferred_outputagainst the desired behavior.
Prepare your data
Each line in the JSONL file carries:input.messages: the context, in conversational format.preferred_output: a single assistant message representing the ideal response.non_preferred_output: a single assistant message representing the suboptimal response.
Launch a DPO job
Settraining_method to "dpo". The full list of DPO parameters lives in the API reference.
DPO parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
dpo_beta | float | 0.1 | How far the model is allowed to drift from the reference model. Lower values (around 0.1) update more aggressively toward the preferred output; higher values (around 0.7) stay closer to the reference. Useful range is 0.05 to 0.9. |
dpo_normalize_logratios_by_length | bool | false | Normalize log ratios by sample length during loss calculation. |
rpo_alpha | float | 0.0 | Incorporate the NLL loss on selected samples with this weight. |
simpo_gamma | float | 0.0 | Add a margin to the loss, force-enable length normalization, and exclude reference logits. Matches the SimPO loss. |
For LoRA long-context fine-tuning, half the context length is used for the preferred response and half for the non-preferred response. On a 32k model, effective context per side is 16k. Preference tuning ignores the
train_on_inputs flag because the loss is computed from the preferred and non-preferred outputs.DPO metrics
Beyond standard training metrics, DPO jobs report:- Accuracy: The share of examples where the reward for the preferred response exceeds the reward for the non-preferred response.
- KL divergence: Similarity of output distributions between the trained model and the reference model.
- Per-side log probabilities: For both preferred and non-preferred outputs, useful for debugging stalled runs.
Combine SFT and DPO
The recommended workflow when your training data differs substantially from the base model’s pretraining distribution:- Run a supervised fine-tune on the concatenation of context and preferred output, using one of the supported SFT data formats.
- Continue training the resulting checkpoint with DPO. Pass the previous job’s checkpoint to
from_checkpoint:
Python