Preference Fine-Tuning

Preference fine-tuning allows you to train models using pairs of preferred and non-preferred examples. This approach is more effective than standard fine-tuning when you have paired examples that show which responses your model should generate and which it should avoid. We use Direct Preference Optimization (DPO) for this type of fine-tuning. Before proceeding: Review our How-to: Fine-tuning guide for an overview of the fine-tuning process.

Data Preparation

Your dataset should contain examples with:

An input field with messages in in the conversational format.
A preferred_output field with the ideal assistant response
A non_preferred_output field with a suboptimal assistant response

Both outputs must contain exactly one message from the assistant role. Format your data in JSONL, with each line structured as:

{
  "input": {
    "messages": [
      {
        "role": "assistant",
        "content": "Hello, how can I assist you today?"
      },
      {
        "role": "user",
        "content": "Can you tell me about the rise of the Roman Empire?"
      }
    ]
  },
  "preferred_output": [
    {
      "role": "assistant",
      "content": "The Roman Empire rose from a small city-state founded in 753 BCE. Through military conquests and strategic alliances, Rome expanded across the Italian peninsula. After the Punic Wars, it grew even stronger, and in 27 BCE, Augustus became the first emperor, marking the start of the Roman Empire. This led to a period of peace and prosperity known as the Pax Romana."
    }
  ],
  "non_preferred_output": [
    {
      "role": "assistant",
      "content": "The Roman Empire rose due to military strength and strategic alliances."
    }
  ]
}

Preference-tuning does not support pretokenized datasets. Contact us if you need to use them for preference training.

Launching preference fine-tuning

Hyperparameters

Set --training-method="dpo"
The --dpo-beta parameter controls how much the model is allowed to deviate from its reference (or pre-tuned) model during fine-tuning. The default value is 0.1 but you can experiment with values between 0.05-0.9
- A lower value of beta (e.g., 0.1) allows the model to update more aggressively toward preferred responses
- A higher value of beta(e.g., 0.7) keeps the updated model closer to the reference behavior.
The --dpo-normalize-logratios-by-length parameter (optional, default is False) enables normalization of log ratios by sample length during the DPO loss calculation.
The --rpo-alpha coefficient (optional, default is 0.0) incorporates the NLL loss on selected samples with the corresponding weight.
The --simpo-gamma coefficient (optional, default is 0.0) adds a margin to the loss calculation, force-enables log ratio normalization (—dpo-normalize-logratios-by-length), and excludes reference logits from the loss computation. The resulting loss function is equivalent to the one used in the SimPO paper.

together fine-tuning create \
  --training-file $FILE_ID \
  --model "meta-llama/Llama-3.2-3B-Instruct" \
  --wandb-api-key $WANDB_API_KEY \
  --lora \
  --training-method "dpo" \
  --dpo-beta 0.2

Note

For LoRA Long-context fine-tuning we currently use half of the context length for the preferred response and half for the non-preferred response. So, if you are using a 32K model, the effective context length will be 16K.
Preference fine-tuning calculates loss based on the preferred and non-preferred outputs. Therefore, the --train-on-inputs flag is ignored with preference fine-tuning.

Metrics

In addition to standard metrics like losses, for DPO we report:

Accuracies — percentage of times the reward for the preferred response is greater than the reward for the non-preferred response.
KL Divergence — similarity of output distributions between the trained model and the reference model, calculated as:

Combining methods: supervised fine-tuning & preference fine-tuning

Supervised fine-tuning (SFT) is the default method on our platform. The recommended approach is to first perform SFT followed up by preference tuning as follows:

First perform supervised fine-tuning (SFT) on your data.
Then refine with preference fine-tuning using continued fine-tuning on your SFT checkpoint.

Performing SFT on your dataset prior to DPO can significantly increase the resulting model quality, especially if your training data differs significantly from the data the base model observed during pretraining. To perform SFT, you can concatenate the context with the preferred output and use one of our SFT data formats .

Getting Started

Inference

Capabilities

Examples

Training

Guides

❓ Frequently Asked Questions

Data Preparation

Launching preference fine-tuning

Hyperparameters

Metrics

Combining methods: supervised fine-tuning & preference fine-tuning

Getting Started

Inference

Capabilities

Examples

Training

Guides

❓ Frequently Asked Questions

​Data Preparation

​Launching preference fine-tuning

​Hyperparameters

​Metrics

​Combining methods: supervised fine-tuning & preference fine-tuning

Data Preparation

Launching preference fine-tuning

Hyperparameters

Metrics

Combining methods: supervised fine-tuning & preference fine-tuning