Skip to main content

Overview

LoRA (Low-Rank Adaptation) enables efficient fine-tuning of large language models by training only a small set of additional parameters while keeping the original model weights frozen. This approach delivers several key advantages:
  • Reduced training costs: Trains fewer parameters than full fine-tuning, using less GPU memory
  • Faster deployment: Produces compact adapter files that can be quickly shared and deployed
Together AI handles the entire LoRA workflow: fine-tune your model and start running inference immediately.
Important: Adapters trained before December 17, 2024, require migration to work with the current serverless infrastructure. As a temporary workaround, you can download and re-upload these adapters following the instructions in our adapter upload guide.

Quick start

This guide demonstrates how to fine-tune a model using LoRA and deploy it for serverless inference. For comprehensive fine-tuning options and best practices, refer to the Fine-Tuning Guide.

Prerequisites

Step 1: Upload Training Data

First, upload your training dataset to Together AI:
together files upload "your-datafile.jsonl"

Step 2: Create Fine-tuning Job

Launch a LoRA fine-tuning job using the uploaded file ID:
together fine-tuning create \
  --training-file "file-629e58b4-ff73-438c-b2cc-f69542b27980" \
  --model "meta-llama/Meta-Llama-3.1-8B-Instruct-Reference" \
  --lora
Note: If you plan to use a validation set, make sure to set the --validation-file and --n-evals (the number of evaluations over the entire job) parameters. --n-evals needs to be set as a number above 0 in order for your validation set to be used.
Once you submit the fine-tuning job you should be able to see the model output_name and job_id in the response:
{
  "id": "ft-44129430-ac08-4136-9774-aed81e0164a4",
  "training_file": "file-629e58b4-ff73-438c-b2cc-f69542b27980",
  "validation_file": "",
  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct-Reference",
  "output_name": "zainhas/Meta-Llama-3.1-8B-Instruct-Reference-my-demo-finetune-4224205a",
  ...
}

Step 3: Getting the output model

Once you submit the fine-tuning job you should be able to see the model output_name and job_id in the response:
{
  "id": "ft-44129430-ac08-4136-9774-aed81e0164a4",
  "training_file": "file-629e58b4-ff73-438c-b2cc-f69542b27980",
  "validation_file": "",
  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct-Reference",
  "output_name": "zainhas/Meta-Llama-3.1-8B-Instruct-Reference-my-demo-finetune-4224205a",
  ...
}
You can also see the status of the job and get the model name if you navigate to your fine-tuned model in the ‘Model’ or ‘Jobs’ tab in the Together dashboard.

Step 4: Running LoRA inference

Once the fine-tuning job is completed, your model is immediately available for inference.
MODEL_NAME_FOR_INFERENCE="zainhas/Meta-Llama-3.1-8B-Instruct-Reference-my-demo-finetune-4224205a"

curl -X POST https://api.together.xyz/v1/chat/completions \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  --data @- <<EOF
{
  "model": "$MODEL_NAME_FOR_INFERENCE",
  "messages": [
    { "role": "user", "content": "debate the pros and cons of AI" }
  ],
  "max_tokens": 128
}
EOF
Expected Response:
{
  "id": "8f2cb236c80ea20e-YYZ",
  "object": "text.completion",
  "created": 1734331375,
  "model": "zainhas/Meta-Llama-3.1-8B-Instruct-Reference-my-demo-finetune-4224205a",
  "prompt": [],
  "choices": [
    {
      "text": "Here's a debate on the pros and cons of AI:\n\n**Moderator:** Welcome to today's debate on the pros and cons of AI. We have two debaters, Alex and Ben, who will present their arguments on the topic. Alex will argue in favor of AI, while Ben will argue against it. Let's begin with opening statements.\n\n**Alex (In Favor of AI):** Thank you, Moderator. AI has revolutionized the way we live and work. It has improved efficiency, productivity, and accuracy in various industries, such as healthcare, finance, and transportation. AI-powered systems can analyze vast amounts of data, identify",
      "finish_reason": "length",
      "seed": 5626645655383684000,
      "logprobs": null,
      "index": 0
    }
  ],
  "usage": {
    "prompt_tokens": 18,
    "completion_tokens": 128,
    "total_tokens": 146,
    "cache_hit_rate": 0
  }
}

Performance Characteristics

Latency Expectations

  • Cold start: Initial requests may experience 5-10 seconds of latency
  • Warm requests: Subsequent queries typically respond under 1 second
  • Optimization tip: Send a warmup query after deployment to minimize cold starts for production traffic

Best Practices

  1. Data Preparation: Ensure your training data follows the correct JSONL format for your chosen model
  2. Validation Sets: Always include validation data to monitor training quality
  3. Model Naming: Use descriptive names for easy identification in production
  4. Warmup Queries: Run test queries immediately after deployment to optimize response times
  5. Monitoring: Track inference metrics through the Together dashboard

Frequently Asked Questions

Which base models support LoRA fine-tuning?

Together AI supports LoRA fine-tuning on a curated selection of high-performance base models. See the complete list for current options.

What are typical inference latencies?

After an initial cold start period (5-10 seconds for the first request), subsequent requests typically achieve sub-second response times. Latency remains consistently low for warm models.

Can I use streaming responses?

Yes, streaming is fully supported. Add "stream": true to your request parameters to receive incremental responses.

How do I migrate pre-December 2024 adapters?

Download your existing adapter files and re-upload them using our adapter upload workflow. We’re working on automated migration for legacy adapters.

What’s the difference between LoRA and full fine-tuning?

LoRA trains only a small set of additional parameters (typically 0.1-1% of model size), resulting in faster training, lower costs, and smaller output files, while full fine-tuning updates all model parameters for maximum customization at higher computational cost.

Next Steps