Learn how to use preference fine-tuning on Together Fine-Tuning Platform
input
field with messages in in the conversational format.preferred_output
field with the ideal assistant responsenon_preferred_output
field with a suboptimal assistant responseJSONL
, with each line structured as:
--training-method="dpo"
--dpo-beta
parameter controls how much the model is allowed to deviate from its reference (or pre-tuned) model during fine-tuning. The default value is 0.1
but you can experiment with values between 0.05-0.9
--dpo-normalize-logratios-by-length
parameter (optional, default is False) enables normalization of log ratios by sample length during the DPO loss calculation.
--rpo-alpha
coefficient (optional, default is 0.0) incorporates the NLL loss on selected samples with the corresponding weight.
--simpo-gamma
coefficient (optional, default is 0.0) adds a margin to the loss calculation, force-enables log ratio normalization (—dpo-normalize-logratios-by-length), and excludes reference logits from the loss computation. The resulting loss function is equivalent to the one used in the SimPO paper.
--train-on-inputs
flag is ignored with preference fine-tuning.