Documentation Index
Fetch the complete documentation index at: https://docs.together.ai/llms.txt
Use this file to discover all available pages before exploring further.
This quickstart walks through a full fine-tuning run on Together AI: preparing a conversational dataset (CoQA), launching a LoRA job on Llama 3.1 8B, monitoring progress, running inference on the result, and comparing it to the base model. For background on what fine-tuning is and when to use it, see the overview.
A runnable notebook version lives on GitHub: fine-tuning guide notebook.
Prerequisites
-
Create an account. Sign up at Together AI and generate an API key.
-
Set your API key.
export TOGETHER_API_KEY=<your-key>
-
Install the libraries.
pip install -U together datasets transformers tqdm
Choose a base model
The first step is picking a model to fine-tune:
- Base models are trained on a wide variety of text and make broad predictions.
- Instruct models are trained on instruction-response pairs and tend to do better on specific tasks.
For a first run, start with an instruction-tuned model:
Qwen/Qwen3-8B for simpler tasks.
Qwen/Qwen3-32B for more complex datasets and domains.
See fine-tuning models for the full catalog.
Step 1: Prepare your dataset
Fine-tuning requires data formatted in a specific way. This walkthrough uses a conversational dataset so the fine-tuned model can improve on multi-turn conversations.
Data formats
Together AI supports several data formats:
-
Conversational data. A JSON object per line, where each object contains a list of conversation turns under the
messages key. Each message must have a role (system, user, or assistant) and content. See conversational data.
{
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Hello!" },
{ "role": "assistant", "content": "Hi! How can I help you?" }
]
}
-
Instruction data. For instruction-based tasks with prompt-completion pairs. See instruction data.
-
Preference data. For preference-based fine-tuning. See preference data.
-
Generic text data. For plain text completion tasks. See generic text data.
File formats
Together AI supports two file formats:
- JSONL. Simpler and works for most cases.
- Parquet. Stores pre-tokenized data and gives you the flexibility to specify custom attention masks and labels (loss masking).
Default to JSONL unless you need custom tokenization or specific loss masking, in which case use Parquet.
Example: preparing the CoQA dataset
Here’s an example of transforming the CoQA dataset into the required chat format:
from datasets import load_dataset
## Load the dataset
coqa_dataset = load_dataset("stanfordnlp/coqa")
## The system prompt, if present, must always be at the beginning
system_prompt = (
"Read the story and extract answers for the questions.\nStory: {}"
)
def map_fields(row):
# Create system prompt
messages = [
{"role": "system", "content": system_prompt.format(row["story"])}
]
# Add user and assistant messages
for q, a in zip(row["questions"], row["answers"]["input_text"]):
messages.append({"role": "user", "content": q})
messages.append({"role": "assistant", "content": a})
return {"messages": messages}
## Transform the data using the mapping function
train_messages = coqa_dataset["train"].map(
map_fields,
remove_columns=coqa_dataset["train"].column_names,
)
## Save data to JSON file
train_messages.to_json("coqa_prepared_train.jsonl")
Loss masking
In some cases, you may want the model to predict only a specific part of the prompt:
- With conversational or instruction data, set
train_on_inputs (bool or auto) to mask user messages in conversational data or prompts in instruction data.
- With conversational data, assign weights to specific messages to mask them.
- With pre-tokenized data (Parquet), set the
label for specific tokens to -100 to mask them.
Check and upload your data
Once your data is prepared, verify it’s correctly formatted and upload it to Together AI:
from together import Together
import os
import json
TOGETHER_API_KEY = os.getenv("TOGETHER_API_KEY")
WANDB_API_KEY = os.getenv(
"WANDB_API_KEY"
) # Optional, for logging fine-tuning to wandb
## Check the file format
from together.utils import check_file
client = Together(api_key=TOGETHER_API_KEY)
sft_report = check_file("coqa_prepared_train.jsonl")
print(json.dumps(sft_report, indent=2))
assert sft_report["is_check_passed"] == True
## Upload the data to Together
train_file_resp = client.files.upload(
"coqa_prepared_train.jsonl", purpose="fine-tune", check=True
)
print(train_file_resp.id) # Save this ID for starting your fine-tuning job
The output from checking the file should look similar to:
{
"is_check_passed": true,
"message": "Checks passed",
"found": true,
"file_size": 23777505,
"utf8": true,
"line_type": true,
"text_field": true,
"key_value": true,
"has_min_samples": true,
"num_samples": 7199,
"load_json": true,
"filetype": "jsonl"
}
Step 2: Start the fine-tuning job
With the data uploaded, launch the fine-tuning job with client.fine_tuning.create().
Key parameters
model: The base model you want to fine-tune (e.g., 'meta-llama/Meta-Llama-3.1-8B-Instruct-Reference')
training_file: The ID of your uploaded training JSONL file
validation_file: Optional ID of validation file (highly recommended for monitoring)
suffix: A custom string added to create your unique model name (e.g., 'test1_8b')
n_epochs: Number of times the model sees the entire dataset
n_checkpoints: Number of checkpoints to save during training (for resuming or selecting the best model)
learning_rate: Controls how much model weights are updated
batch_size: Number of examples processed per iteration (default: “max”)
lora: Set to True for LoRA fine-tuning
train_on_inputs: Whether to mask user messages or prompts (can be bool or ‘auto’)
warmup_ratio: Ratio of steps for warmup
For the full list of fine-tuning parameters, see the fine-tuning API reference.
LoRA fine-tuning (recommended)
## Using Python - This fine-tuning job should take ~10-15 minutes to complete
ft_resp = client.fine_tuning.create(
training_file=train_file_resp.id,
model="meta-llama/Meta-Llama-3.1-8B-Instruct-Reference",
train_on_inputs="auto",
n_epochs=3,
n_checkpoints=1,
wandb_api_key=WANDB_API_KEY, # Optional, for visualization
lora=True, # Default True
warmup_ratio=0,
learning_rate=1e-5,
suffix="test1_8b",
)
print(ft_resp.id) # Save this job ID for monitoring
Full fine-tuning
For full fine-tuning, set lora to False:
## Using Python
ft_resp = client.fine_tuning.create(
training_file=train_file_resp.id,
model="meta-llama/Meta-Llama-3.1-8B-Instruct-Reference",
train_on_inputs="auto",
n_epochs=3,
n_checkpoints=1,
warmup_ratio=0,
lora=False, # Must be specified as False, defaults to True
learning_rate=1e-5,
suffix="test1_8b_full_finetune",
)
The response includes your job ID, which you’ll use to monitor progress:
ft-d1522ffb-8f3e #fine-tuning job id
Step 3: Monitor the job
Fine-tuning takes time, depending on the model size, dataset size, and hyperparameters. Your job progresses through several states: Pending, Queued, Running, Uploading, and Completed.
You can monitor and manage progress with the following methods:
- List all jobs:
client.fine_tuning.list()
- Status of a job:
client.fine_tuning.retrieve(id=ft_resp.id)
- List all events for a job:
client.fine_tuning.list_events(id=ft_resp.id) - Retrieves logs and events generated during the job
- Cancel job:
client.fine_tuning.cancel(id=ft_resp.id)
- Download fine-tuned model:
client.fine_tuning.download(id=ft_resp.id) (v1) or client.fine_tuning.with_streaming_response.content(ft_id=ft_resp.id) (v2)
Once the job is complete (status == 'completed'), the response from retrieve contains the name of your newly created fine-tuned model. It follows the pattern <your-account>/<base-model-name>:<suffix>:<job-id>.
Check status via the API
## Check status of the job
resp = client.fine_tuning.retrieve(ft_resp.id)
print(resp.status)
## This loop will print the logs of the job thus far
for event in client.fine_tuning.list_events(id=ft_resp.id).data:
print(event.message)
Example output:
Fine tune request created
Job started at Thu Apr 3 03:19:46 UTC 2025
Model data downloaded for togethercomputer/Meta-Llama-3.1-8B-Instruct-Reference__TOG__FT at Thu Apr 3 03:19:48 UTC 2025
Data downloaded for togethercomputer/Meta-Llama-3.1-8B-Instruct-Reference__TOG__FT at 2025-04-03T03:19:55.595750
WandB run initialized.
Training started for model togethercomputer/Meta-Llama-3.1-8B-Instruct-Reference__TOG__FT
Epoch completed, at step 24
Epoch completed, at step 48
Epoch completed, at step 72
Training completed for togethercomputer/Meta-Llama-3.1-8B-Instruct-Reference__TOG__FT at Thu Apr 3 03:27:55 UTC 2025
Uploading output model
Compressing output model
Model compression complete
Model upload complete
Job finished at Thu Apr 3 03:31:33 UTC 2025
Dashboard monitoring
You can also monitor your job on the Together AI jobs dashboard.
If you provided a Weights & Biases API key, you can view detailed training metrics on the W&B platform, including loss curves and more.
Delete a fine-tuning job
You can also delete a fine-tuning job. This action cannot be undone and destroys all files produced by the job, including intermediate and final checkpoints.
## Run delete
resp = client.fine_tuning.delete(ft_resp.id)
print(resp)
Step 4: Use the fine-tuned model
Once the job completes, your model is available for inference.
Deploy a dedicated endpoint
To run your fine-tuned model, deploy it on a dedicated endpoint:
-
Visit your models dashboard.
-
Select + Create dedicated endpoint for your fine-tuned model.
-
Select hardware configuration and scaling options, including min and max replicas (which set the maximum QPS the deployment can support), and select Deploy.
You can also deploy programmatically:
response = client.endpoints.create(
display_name="Fine-tuned Meta Llama 3.1 8B Instruct 04-09-25",
model="zainhas/Meta-Llama-3.1-8B-Instruct-Reference-test1_8b-e5a0fb5d",
hardware="4x_nvidia_h100_80gb_sxm",
autoscaling={"min_replicas": 1, "max_replicas": 1},
)
print(response)
Running this code deploys a dedicated endpoint, which bills per minute. For the full API, see
endpoints reference.
Once deployed, query the endpoint:
response = client.chat.completions.create(
model="zainhas/Meta-Llama-3.1-8B-Instruct-Reference-test1_8b-e5a0fb5d-ded38e09",
messages=[{"role": "user", "content": "What is the capital of France?"}],
max_tokens=128,
)
print(response.choices[0].message.content)
Step 5: Evaluate the fine-tuned model
To measure the impact of fine-tuning, compare responses from the fine-tuned model and the base model on the same prompts in a held-out test set.
Use a validation set during training
Pass a validation set when starting your fine-tuning job:
response = client.fine_tuning.create(
training_file="your-training-file-id",
validation_file="your-validation-file-id",
n_evals=10, # Number of times to evaluate on validation set
model="meta-llama/Meta-Llama-3.1-8B-Instruct-Reference",
)
Post-training evaluation example
Here’s a comprehensive example of evaluating models after fine-tuning, using the CoQA dataset:
- First, load a portion of the validation dataset:
coqa_dataset_validation = load_dataset(
"stanfordnlp/coqa",
split="validation[:50]",
)
- Define a function to generate answers from both models:
from tqdm.auto import tqdm
from multiprocessing.pool import ThreadPool
base_model = "Qwen/Qwen3.5-9B" # Original model
finetuned_model = ft_resp.output_name # Fine-tuned model
def get_model_answers(model_name):
"""
Generate model answers for a given model name using a dataset of questions and answers.
Args:
model_name (str): The name of the model to use for generating answers.
Returns:
list: A list of lists, where each inner list contains the answers generated by the model.
"""
model_answers = []
system_prompt = (
"Read the story and extract answers for the questions.\nStory: {}"
)
def get_answers(data):
answers = []
messages = [
{
"role": "system",
"content": system_prompt.format(data["story"]),
}
]
for q, true_answer in zip(
data["questions"],
data["answers"]["input_text"],
):
try:
messages.append({"role": "user", "content": q})
response = client.chat.completions.create(
messages=messages,
model=model_name,
max_tokens=64,
)
answer = response.choices[0].message.content
answers.append(answer)
except Exception:
answers.append("Invalid Response")
return answers
# We'll use 8 threads to generate answers faster in parallel
with ThreadPool(8) as pool:
for answers in tqdm(
pool.imap(get_answers, coqa_dataset_validation),
total=len(coqa_dataset_validation),
):
model_answers.append(answers)
return model_answers
- Generate answers from both models:
base_answers = get_model_answers(base_model)
finetuned_answers = get_model_answers(finetuned_model)
- Define a function to calculate evaluation metrics:
import transformers.data.metrics.squad_metrics as squad_metrics
def get_metrics(pred_answers):
"""
Calculate the Exact Match (EM) and F1 metrics for predicted answers.
Args:
pred_answers (list): A list of predicted answers.
Returns:
tuple: A tuple containing EM score and F1 score.
"""
em_metrics = []
f1_metrics = []
for pred, data in tqdm(
zip(pred_answers, coqa_dataset_validation),
total=len(pred_answers),
):
for pred_answer, true_answer in zip(
pred, data["answers"]["input_text"]
):
em_metrics.append(
squad_metrics.compute_exact(true_answer, pred_answer)
)
f1_metrics.append(
squad_metrics.compute_f1(true_answer, pred_answer)
)
return sum(em_metrics) / len(em_metrics), sum(f1_metrics) / len(f1_metrics)
- Calculate and compare metrics:
## Calculate metrics for both models
em_base, f1_base = get_metrics(base_answers)
em_ft, f1_ft = get_metrics(finetuned_answers)
print(f"Base Model - EM: {em_base:.2f}, F1: {f1_base:.2f}")
print(f"Fine-tuned Model - EM: {em_ft:.2f}, F1: {f1_ft:.2f}")
You should get figures similar to the table below:
| Llama 3.1 8B | EM | F1 |
|---|
| Original | 0.01 | 0.18 |
| Fine-tuned | 0.32 | 0.41 |
The fine-tuned model performs significantly better on the test set, with a large improvement in both Exact Match and F1 scores.
Advanced topics
Continue a fine-tuning job
You can continue training from a previous fine-tuning job:
response = client.fine_tuning.create(
training_file="your-new-training-file-id",
from_checkpoint="previous-finetune-job-id",
wandb_api_key="your-wandb-api-key",
)
You can specify a checkpoint by using:
- The output model name from the previous job
- Fine-tuning job ID
- A specific checkpoint step with the format
ft-...:{STEP_NUM}
To check all available checkpoints for a job:
together fine-tuning list-checkpoints <FT_JOB_ID>
Training and validation split
To split your dataset into training and validation sets:
split_ratio=0.9 # Specify the split ratio for your training set
total_lines=$(wc -l < "your-datafile.jsonl")
split_lines=$((total_lines * split_ratio))
head -n $split_lines "your-datafile.jsonl" > "your-datafile-train.jsonl"
tail -n +$((split_lines + 1)) "your-datafile.jsonl" > "your-datafile-validation.jsonl"
Use a validation set during training
A validation set is a held-out dataset used to evaluate model performance during training on unseen data. Using a validation set helps monitor for overfitting and assists with hyperparameter tuning.
To use a validation set, provide validation_file and set n_evals to a number above 0:
response = client.fine_tuning.create(
training_file="your-training-file-id",
validation_file="your-validation-file-id",
n_evals=10, # Number of evaluations over the entire job
model="meta-llama/Meta-Llama-3.1-8B-Instruct-Reference",
)
At set intervals during training, the model is evaluated on your validation set and the evaluation loss is recorded in your job event log. If you provide a W&B API key, these losses also appear in the W&B dashboard.
Next steps