LoRA Fine-Tuning and Inference

Overview

LoRA (Low-Rank Adaptation) enables efficient fine-tuning of large language models by training only a small set of additional parameters while keeping the original model weights frozen. This approach delivers several key advantages:

Reduced training costs: Trains fewer parameters than full fine-tuning, using less GPU memory
Faster deployment: Produces compact adapter files that can be quickly shared and deployed

Together AI handles the entire LoRA workflow: fine-tune your model and start running inference immediately.

Important: Adapters trained before December 17, 2024, require migration to work with the current serverless infrastructure. As a temporary workaround, you can download and re-upload these adapters following the instructions in our adapter upload guide.

Quick start

This guide demonstrates how to fine-tune a model using LoRA and deploy it for serverless inference. For comprehensive fine-tuning options and best practices, refer to the Fine-Tuning Guide.

Prerequisites

Together AI API key
Training data in the JSONL format
Compatible base model selection

Step 1: Upload Training Data

First, upload your training dataset to Together AI:

together files upload "your-datafile.jsonl"

Step 2: Create Fine-tuning Job

Launch a LoRA fine-tuning job using the uploaded file ID:

together fine-tuning create \
  --training-file "file-629e58b4-ff73-438c-b2cc-f69542b27980" \
  --model "meta-llama/Meta-Llama-3.1-8B-Instruct-Reference" \
  --lora

Note: If you plan to use a validation set, make sure to set the --validation-file and --n-evals (the number of evaluations over the entire job) parameters. --n-evals needs to be set as a number above 0 in order for your validation set to be used.

Once you submit the fine-tuning job you should be able to see the model output_name and job_id in the response:

{
  "id": "ft-44129430-ac08-4136-9774-aed81e0164a4",
  "training_file": "file-629e58b4-ff73-438c-b2cc-f69542b27980",
  "validation_file": "",
  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct-Reference",
  "output_name": "zainhas/Meta-Llama-3.1-8B-Instruct-Reference-my-demo-finetune-4224205a",
  ...
}

Step 3: Getting the output model

Once you submit the fine-tuning job you should be able to see the model output_name and job_id in the response:

{
  "id": "ft-44129430-ac08-4136-9774-aed81e0164a4",
  "training_file": "file-629e58b4-ff73-438c-b2cc-f69542b27980",
  "validation_file": "",
  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct-Reference",
  "output_name": "zainhas/Meta-Llama-3.1-8B-Instruct-Reference-my-demo-finetune-4224205a",
  ...
}

You can also see the status of the job and get the model name if you navigate to your fine-tuned model in the ‘Model’ or ‘Jobs’ tab in the Together dashboard.

Step 4: Running LoRA inference

Once the fine-tuning job is completed, your model is immediately available for inference.

MODEL_NAME_FOR_INFERENCE="zainhas/Meta-Llama-3.1-8B-Instruct-Reference-my-demo-finetune-4224205a"

curl -X POST https://api.together.xyz/v1/chat/completions \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  --data @- <<EOF
{
  "model": "$MODEL_NAME_FOR_INFERENCE",
  "messages": [
    { "role": "user", "content": "debate the pros and cons of AI" }
  ],
  "max_tokens": 128
}
EOF

Expected Response:

{
  "id": "8f2cb236c80ea20e-YYZ",
  "object": "text.completion",
  "created": 1734331375,
  "model": "zainhas/Meta-Llama-3.1-8B-Instruct-Reference-my-demo-finetune-4224205a",
  "prompt": [],
  "choices": [
    {
      "text": "Here's a debate on the pros and cons of AI:\n\n**Moderator:** Welcome to today's debate on the pros and cons of AI. We have two debaters, Alex and Ben, who will present their arguments on the topic. Alex will argue in favor of AI, while Ben will argue against it. Let's begin with opening statements.\n\n**Alex (In Favor of AI):** Thank you, Moderator. AI has revolutionized the way we live and work. It has improved efficiency, productivity, and accuracy in various industries, such as healthcare, finance, and transportation. AI-powered systems can analyze vast amounts of data, identify",
      "finish_reason": "length",
      "seed": 5626645655383684000,
      "logprobs": null,
      "index": 0
    }
  ],
  "usage": {
    "prompt_tokens": 18,
    "completion_tokens": 128,
    "total_tokens": 146,
    "cache_hit_rate": 0
  }
}

Performance Characteristics

Latency Expectations

Cold start: Initial requests may experience 5-10 seconds of latency
Warm requests: Subsequent queries typically respond under 1 second
Optimization tip: Send a warmup query after deployment to minimize cold starts for production traffic

Best Practices

Data Preparation: Ensure your training data follows the correct JSONL format for your chosen model
Validation Sets: Always include validation data to monitor training quality
Model Naming: Use descriptive names for easy identification in production
Warmup Queries: Run test queries immediately after deployment to optimize response times
Monitoring: Track inference metrics through the Together dashboard

Frequently Asked Questions

Which base models support LoRA fine-tuning?

Together AI supports LoRA fine-tuning on a curated selection of high-performance base models. See the complete list for current options.

What are typical inference latencies?

After an initial cold start period (5-10 seconds for the first request), subsequent requests typically achieve sub-second response times. Latency remains consistently low for warm models.

Can I use streaming responses?

Yes, streaming is fully supported. Add "stream": true to your request parameters to receive incremental responses.

How do I migrate pre-December 2024 adapters?

Download your existing adapter files and re-upload them using our adapter upload workflow. We’re working on automated migration for legacy adapters.

What’s the difference between LoRA and full fine-tuning?

LoRA trains only a small set of additional parameters (typically 0.1-1% of model size), resulting in faster training, lower costs, and smaller output files, while full fine-tuning updates all model parameters for maximum customization at higher computational cost.

Next Steps

Explore advanced fine-tuning parameters for optimizing model performance
Learn about uploading custom adapters trained outside Together AI

Getting Started

Inference

Training

Capabilities

Other APIs

LoRA Fine-Tuning and Inference

Overview

Quick start

Prerequisites

Step 1: Upload Training Data

Step 2: Create Fine-tuning Job

Step 3: Getting the output model

Step 4: Running LoRA inference

Performance Characteristics

Latency Expectations

Best Practices

Frequently Asked Questions

Which base models support LoRA fine-tuning?

What are typical inference latencies?

Can I use streaming responses?

How do I migrate pre-December 2024 adapters?

What’s the difference between LoRA and full fine-tuning?

Next Steps

Getting Started

Inference

Training

Capabilities

Other APIs

​Overview

​Quick start

​Prerequisites

​Step 1: Upload Training Data

​Step 2: Create Fine-tuning Job

​Step 3: Getting the output model

​Step 4: Running LoRA inference

​Performance Characteristics

​Latency Expectations

​Best Practices

​Frequently Asked Questions

​Which base models support LoRA fine-tuning?

​What are typical inference latencies?

​Can I use streaming responses?

​How do I migrate pre-December 2024 adapters?

​What’s the difference between LoRA and full fine-tuning?

​Next Steps

Overview

Quick start

Prerequisites

Step 1: Upload Training Data

Step 2: Create Fine-tuning Job

Step 3: Getting the output model

Step 4: Running LoRA inference

Performance Characteristics

Latency Expectations

Best Practices

Frequently Asked Questions

Which base models support LoRA fine-tuning?

What are typical inference latencies?

Can I use streaming responses?

How do I migrate pre-December 2024 adapters?

What’s the difference between LoRA and full fine-tuning?

Next Steps