Skip to main content
Preference fine-tuning trains a model on paired examples that show which responses you want it to generate and which it should avoid. Together AI implements this with Direct Preference Optimization (DPO). This is more effective than supervised fine-tuning when you have ranked outputs for the same prompt.

When to use preference tuning

Consider using preference tuning when:
  • You have ranked pairs of responses for the same prompt. Standard supervised fine-tuning (SFT) only learns from a single target completion per example. DPO learns from the gap between good and bad (preferred and dispreferred) responses.
  • You’re polishing a model that already works. DPO is most effective on top of an SFT-tuned checkpoint. Do SFT first, then continue with DPO from that checkpoint.
  • You want to reduce specific failure modes. Pair the failure as non_preferred_output against the desired behavior.
Skip DPO if your dataset is single-target. Use supervised fine-tuning instead.

Prepare your data

Each line in the JSONL file carries:
  • input.messages: the context, in conversational format.
  • preferred_output: a single assistant message representing the ideal response.
  • non_preferred_output: a single assistant message representing the suboptimal response.
{
  "input": {
    "messages": [
      {"role": "assistant", "content": "Hello, how can I assist you today?"},
      {"role": "user", "content": "Can you tell me about the rise of the Roman Empire?"}
    ]
  },
  "preferred_output": [
    {"role": "assistant", "content": "The Roman Empire rose from a small city-state founded in 753 BCE. Through military conquests and strategic alliances, Rome expanded across the Italian peninsula. After the Punic Wars, it grew even stronger, and in 27 BCE, Augustus became the first emperor, marking the start of the Roman Empire."}
  ],
  "non_preferred_output": [
    {"role": "assistant", "content": "The Roman Empire rose due to military strength and strategic alliances."}
  ]
}
Each output must contain exactly one assistant message. Preference tuning does not support pre-tokenized Parquet datasets. Contact us if you need this feature.
For tool-calling preference data, see data preparation. For reasoning preference data, see reasoning preference format.

Launch a DPO job

Set training_method to "dpo". The full list of DPO parameters lives in the API reference.
from together import Together

client = Together()

job = client.fine_tuning.create(
    training_file="<FILE_ID>",
    model="meta-llama/Llama-3.2-3B-Instruct",
    lora=True,
    training_method="dpo",
    dpo_beta=0.2,
)
print(job.id)

DPO parameters

ParameterTypeDefaultDescription
dpo_betafloat0.1How far the model is allowed to drift from the reference model. Lower values (around 0.1) update more aggressively toward the preferred output; higher values (around 0.7) stay closer to the reference. Useful range is 0.05 to 0.9.
dpo_normalize_logratios_by_lengthboolfalseNormalize log ratios by sample length during loss calculation.
rpo_alphafloat0.0Incorporate the NLL loss on selected samples with this weight.
simpo_gammafloat0.0Add a margin to the loss, force-enable length normalization, and exclude reference logits. Matches the SimPO loss.
For LoRA long-context fine-tuning, half the context length is used for the preferred response and half for the non-preferred response. On a 32k model, effective context per side is 16k. Preference tuning ignores the train_on_inputs flag because the loss is computed from the preferred and non-preferred outputs.

DPO metrics

Beyond standard training metrics, DPO jobs report:
  • Accuracy: The share of examples where the reward for the preferred response exceeds the reward for the non-preferred response.
  • KL divergence: Similarity of output distributions between the trained model and the reference model.
  • Per-side log probabilities: For both preferred and non-preferred outputs, useful for debugging stalled runs.
For how to retrieve these values during or after a run, see monitoring a fine-tuning job.

Combine SFT and DPO

The recommended workflow when your training data differs substantially from the base model’s pretraining distribution:
  1. Run a supervised fine-tune on the concatenation of context and preferred output, using one of the supported SFT data formats.
  2. Continue training the resulting checkpoint with DPO. Pass the previous job’s checkpoint to from_checkpoint:
Python
job = client.fine_tuning.create(
    training_file="<DPO_FILE_ID>",
    from_checkpoint="<SFT_JOB_ID>",
    training_method="dpo",
    dpo_beta=0.2,
)
SFT first followed by DPO usually produces a noticeably better model than DPO alone for out-of-domain tasks.