Overview
LoRA (Low-Rank Adaptation) enables efficient fine-tuning of large language models by training only a small set of additional parameters while keeping the original model weights frozen. This approach delivers several key advantages:- Reduced training costs: Trains fewer parameters than full fine-tuning, using less GPU memory
- Faster deployment: Produces compact adapter files that can be quickly shared and deployed
Important: Adapters trained before December 17, 2024, require migration to work with the current serverless infrastructure. As a temporary workaround, you can download and re-upload these adapters following the instructions in our adapter upload guide.
Quick start
This guide demonstrates how to fine-tune a model using LoRA and deploy it for serverless inference. For comprehensive fine-tuning options and best practices, refer to the Fine-Tuning Guide.Prerequisites
- Together AI API key
- Training data in the JSONL format
- Compatible base model selection
Step 1: Upload Training Data
First, upload your training dataset to Together AI:Step 2: Create Fine-tuning Job
Launch a LoRA fine-tuning job using the uploaded file ID:Note: If you plan to use a validation set, make sure to set theOnce you submit the fine-tuning job you should be able to see the model--validation-fileand--n-evals(the number of evaluations over the entire job) parameters.--n-evalsneeds to be set as a number above 0 in order for your validation set to be used.
output_name and job_id in the response:
Step 3: Getting the output model
Once you submit the fine-tuning job you should be able to see the modeloutput_name and job_id in the response:

Step 4: Running LoRA inference
Once the fine-tuning job is completed, your model is immediately available for inference.Performance Characteristics
Latency Expectations
- Cold start: Initial requests may experience 5-10 seconds of latency
- Warm requests: Subsequent queries typically respond under 1 second
- Optimization tip: Send a warmup query after deployment to minimize cold starts for production traffic
Best Practices
- Data Preparation: Ensure your training data follows the correct JSONL format for your chosen model
- Validation Sets: Always include validation data to monitor training quality
- Model Naming: Use descriptive names for easy identification in production
- Warmup Queries: Run test queries immediately after deployment to optimize response times
- Monitoring: Track inference metrics through the Together dashboard
Frequently Asked Questions
Which base models support LoRA fine-tuning?
Together AI supports LoRA fine-tuning on a curated selection of high-performance base models. See the complete list for current options.What are typical inference latencies?
After an initial cold start period (5-10 seconds for the first request), subsequent requests typically achieve sub-second response times. Latency remains consistently low for warm models.Can I use streaming responses?
Yes, streaming is fully supported. Add"stream": true to your request parameters to receive incremental responses.
How do I migrate pre-December 2024 adapters?
Download your existing adapter files and re-upload them using our adapter upload workflow. We’re working on automated migration for legacy adapters.What’s the difference between LoRA and full fine-tuning?
LoRA trains only a small set of additional parameters (typically 0.1-1% of model size), resulting in faster training, lower costs, and smaller output files, while full fine-tuning updates all model parameters for maximum customization at higher computational cost.Next Steps
- Explore advanced fine-tuning parameters for optimizing model performance
- Learn about uploading custom adapters trained outside Together AI