The fine-tuning API accepts two file formats: JSONL for text data and Parquet for pre-tokenized data. Pick JSONL unless you need to set custom attention masks or labels, or you want to skip tokenization to speed up repeated experiments.
The file size limit for either format is 100 GB. The file ID returned by client.files.upload() is what you pass as training_file to a fine-tuning job.
Each line of a JSONL file is one training example, formatted to match your task.
| Format | When to use | Key fields |
|---|
| Conversational | Multi-turn chat or single-turn chat. | messages |
| Instruction | Prompt and completion pairs. | prompt, completion |
| Preference | Paired preferred and dispreferred outputs for DPO. | input, preferred_output, non_preferred_output |
| Generic text | Free-form text completion. | text |
If the same file has two possible formats (for example both text and messages), the server rejects it. Trim unused fields before upload to speed up data transfer.
Conversational data
Conversations are represented using a messages array. Each message has a role (system, user, or assistant) and content. The conversation must start with system or user and alternate user and assistant afterwards.
{
"messages": [
{"role": "system", "content": "This is a system prompt."},
{"role": "user", "content": "Hello, how are you?"},
{"role": "assistant", "content": "I'm doing well, thank you! How can I help you?"},
{"role": "user", "content": "Can you explain machine learning?"},
{"role": "assistant", "content": "Machine learning is..."}
]
}
By default, training computes loss only on assistant messages. Pass train_on_inputs=True to include the rest.
You can also set per-message weights to control which messages contribute to the loss. Only 0 and 1 are supported, and messages with weight=0 are masked.
{
"messages": [
{"role": "user", "content": "Question A", "weight": 0},
{"role": "assistant", "content": "Answer A", "weight": 1}
]
}
The dataset is automatically formatted into the model’s chat template if one is defined. Instruction-tuned models always have a chat template; base models usually don’t.
Example datasets:
Instruction data
Each line carries a prompt and a completion field.
{"prompt": "...", "completion": "..."}
{"prompt": "...", "completion": "..."}
By default, training computes loss only on completion. Pass train_on_inputs=True to include prompt.
Example datasets:
Generic text data
Each line carries a single text field. Use this for plain text completions.
{"text": "..."}
{"text": "..."}
Example datasets:
Preference data
Used for preference fine-tuning with DPO. Each line carries:
input.messages: a context in conversational format.
preferred_output: a single assistant message representing the ideal response.
non_preferred_output: a single assistant message representing the suboptimal response.
{
"input": {
"messages": [
{"role": "assistant", "content": "Hi! I'm powered by Together AI's open-source models. Ask me anything."},
{"role": "user", "content": "What's open-source AI?"}
]
},
"preferred_output": [
{"role": "assistant", "content": "Open-source AI means models are free to use, modify, and share. Together AI makes it easy to fine-tune and deploy them."}
],
"non_preferred_output": [
{"role": "assistant", "content": "It means the code is public."}
]
}
Each output must contain exactly one assistant message.
For training a model to invoke tools, the line carries a tools array listing the available tools. Assistant messages can include tool_calls instead of content, and tool-role messages carry call results. See function-calling fine-tuning for the end-to-end workflow.
{
"messages": [
{"role": "user", "content": "What is the current temperature in San Francisco?"},
{"role": "assistant", "tool_calls": [
{"id": "call_abc123", "type": "function", "function": {
"name": "getCurrentWeather", "arguments": "{\"location\": \"San Francisco\"}"
}}
]},
{"role": "tool", "content": "{\"temperature\":\"65\",\"unit\":\"fahrenheit\"}"}
],
"tools": [
{"type": "function", "function": {
"name": "getCurrentWeather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "The city and state, e.g. San Francisco, CA."}
},
"required": ["location"]
}
}}
]
}
For preference fine-tuning, the tools field nests inside input:
{
"input": {
"messages": [{"role": "user", "content": "What is the current temperature in San Francisco?"}],
"tools": [
{"type": "function", "function": {
"name": "getCurrentWeather",
"description": "Get the current weather in a given location",
"parameters": {"type": "object", "properties": {"location": {"type": "string"}}, "required": ["location"]}
}}
]
},
"preferred_output": [
{"role": "assistant", "tool_calls": [
{"id": "call_abc123", "type": "function", "function": {
"name": "getCurrentWeather", "arguments": "{\"location\": \"San Francisco\"}"
}}
]}
],
"non_preferred_output": [
{"role": "assistant", "content": "Sorry, I can't help you with that."}
]
}
Reasoning data
For fine-tuning reasoning models, assistant messages support a reasoning or reasoning_content field that carries the chain of thought. See reasoning fine-tuning for the full workflow.
{
"messages": [
{"role": "user", "content": "What is the capital of France?"},
{
"role": "assistant",
"reasoning": "France is in Western Europe. Its capital is Paris.",
"content": "The capital of France is Paris."
}
]
}
When fine-tuning reasoning models on conversational data, only the last assistant message is trained on by default. For multi-turn reasoning, split the conversation so each assistant message is the final message in its own example.
Reasoning models should always be fine-tuned with reasoning data. Training without it can degrade the model’s reasoning ability. If your dataset doesn’t include reasoning, use an instruct model instead.
For preference fine-tuning, both outputs carry reasoning:
{
"input": {
"messages": [{"role": "user", "content": "What is the capital of France?"}]
},
"preferred_output": [
{"role": "assistant", "reasoning": "France is in Western Europe. Its capital is Paris.", "content": "The capital of France is Paris."}
],
"non_preferred_output": [
{"role": "assistant", "reasoning": "Let me think about European capitals.", "content": "The capital of France is Berlin."}
]
}
Sample weights
All JSONL formats and training methods support a sample weight: a non-negative floating-point number set with a root-level weight key on each line, applied as a loss multiplier to that sample’s tokens. You can combine sample weights with per-message weights in the same file.
{
"messages": [
{"role": "user", "content": "Can you explain machine learning?", "weight": 0},
{"role": "assistant", "content": "Machine learning is...", "weight": 1}
],
"weight": 0.9
}
{
"messages": [
{"role": "user", "content": "Can you explain why?", "weight": 0},
{"role": "assistant", "content": "I can't", "weight": 1}
],
"weight": 0.1
}
Tokenized (Parquet) data
Use Parquet when you want to skip tokenization on every job, customize attention masks or labels, or run with a tokenizer that differs from the base model’s. The file must be .parquet and under 100 GB.
Allowed fields:
| Field | Required | Description |
|---|
input_ids | Yes | Token IDs fed to the model. |
attention_mask | Yes | 1 for tokens the model should attend to, 0 for padding. |
labels | No | Target token IDs. Use -100 to mask a position from the loss. Defaults to input_ids. |
position_ids | No | Position IDs. Reset to 0 at each example boundary inside a packed sequence and increment by 1. Padding tokens also receive 0. |
Here’s a worked example with the together-py tokenize_data.py script:
# With packing (recommended)
python tokenize_data.py \
--tokenizer="NousResearch/Nous-Hermes-2-Mixtral-8x7B-SFT" \
--max-seq-length=32768 \
--add-labels \
--packing \
--out-filename="processed_packed.parquet"
# Without packing
python tokenize_data.py \
--tokenizer="NousResearch/Nous-Hermes-2-Mixtral-8x7B-SFT" \
--max-seq-length=32768 \
--add-labels \
--out-filename="processed_padded.parquet"
Loading the resulting Parquet:
from datasets import load_dataset
packed = load_dataset(
"parquet", data_files={"train": "processed_packed.parquet"}
)
padded = load_dataset(
"parquet", data_files={"train": "processed_padded.parquet"}
)
print(packed["train"])
print(padded["train"])
Packing concatenates multiple short sequences up to max_seq_length to reduce wasted compute. Padding leaves each example on its own. Packing is enabled by default during training. Pass a padded Parquet file if you need to disable it. The packing flag applies only to JSONL input; it has no effect on Parquet data, which is always used as-is.
Validate and upload
Run a local data validation check before uploading to avoid unnecessary charges. The client-side check verifies the file is UTF-8, each non-empty line parses as JSON, the line count exceeds the minimum, and the file is under the maximum size. Schema-level validation runs on the server when you start the job.
import json
from together import Together
from together.lib.utils import check_file
client = Together()
report = check_file("train.jsonl")
print(json.dumps(report, indent=2))
assert report["is_check_passed"]
train_file = client.files.upload(
file="train.jsonl",
purpose="fine-tune",
check=True,
)
print(train_file.id)
Successful check output:
Successful upload returns a file object with an id field. Save the ID—you’ll pass it as training_file to client.fine_tuning.create(). See the quickstart for the full fine-tuning lifecycle.
If you upload a file whose contents already exist on Together AI, client.files.upload() raises a ValueError. The error message includes the existing file’s ID so you can reuse it directly. To re-upload, delete the existing file first with client.files.delete(<file-id>).
Split into train and validation
To carve a validation set out of a single JSONL file:
split_ratio=0.9
total=$(wc -l < train_full.jsonl)
split_lines=$((total * split_ratio / 1))
head -n $split_lines train_full.jsonl > train.jsonl
tail -n +$((split_lines + 1)) train_full.jsonl > validation.jsonl
Then pass both files to the job and set n_evals above 0:
job = client.fine_tuning.create(
training_file="<TRAIN_FILE_ID>",
validation_file="<VALIDATION_FILE_ID>",
n_evals=10,
model="Qwen/Qwen3-8B",
)
The model evaluates against the validation set at the specified intervals. Eval loss appears in the training metrics and (if wandb_api_key is set) on your W&B dashboard.