Introduction
Vision-Language Models (VLMs) combine the power of language understanding with visual comprehension. Fine-tuning a VLM allows you to adapt it to your specific image+text tasks, such as visual question answering, image captioning, or document understanding. This guide covers the specific steps for VLM fine-tuning. For general fine-tuning concepts, environment setup, and hyperparameter details, refer to the Fine-tuning Guide.Quick Links
- Dataset Requirements
- Supported Models
- Check and Upload Dataset
- Start a Fine-tuning Job
- Monitor Progress
- Deploy Your Model
VLM Fine-tuning Dataset
Dataset Requirements:- Format: OpenAI-style
.jsonlfile - Supported types: Conversational, Instruction, Preferential - more details on their purpose here
- Images: Must be base64 encoded with proper MIME type prefixes, maximum 10 images per example, each image is a maximum of 10MB in size.
- If you have image URLs, please download and encode them in base64 first
- Supported image formats: PNG, JPEG, WEBP
messages array where each message has:
role: one ofsystem,user, orassistantcontent: an array containing text and image objects or just text. Onlyusermessages can contain images.
Conversational Format
This is what one row/example from the VLM dataset looks like in conversation format:Instruction Format
Preferential Format
Supported Models
The following models support VLM fine-tuning:| Model | Full Fine-tuning | LoRA Fine-tuning |
|---|---|---|
Qwen/Qwen3-VL-8B-Instruct | ✅ | ✅ |
Qwen/Qwen3-VL-32B-Instruct | ✅ | ✅ |
Qwen/Qwen3-VL-30B-A3B-Instruct | ✅ | ✅ |
Qwen/Qwen3-VL-235B-A22B-Instruct | ❌ | ✅ |
meta-llama/Llama-4-Maverick-17B-128E-Instruct-VLM | ❌ | ✅ |
meta-llama/Llama-4-Scout-17B-16E-Instruct-VLM | ❌ | ✅ |
google/gemma-3-4b-it-VLM | ✅ | ✅ |
google/gemma-3-12b-it-VLM | ✅ | ✅ |
google/gemma-3-27b-it-VLM | ✅ | ✅ |
Check and Upload Dataset
To upload your data, use the CLI or our Python library:file-) to start your fine-tuning job, so store it somewhere before moving on.
You’re now ready to kick off your first fine-tuning job!
Starting a Fine-tuning Job
We support both LoRA and full fine-tuning for VLMs. See how to start a fine-tuning job with either method below.VLM-Specific Parameters
| Parameter | Description | Default |
|---|---|---|
--train-vision / train_vision | Enable updates to the VLM’s vision encoder. When false, only language model parameters are updated. | false |
LoRA Fine-tuning (Recommended)
--train-vision true param to enable updates to VLM’s vision encoder as well. By default, only language model params are updated.
Full Fine-tuning
Monitoring Your Fine-tuning Job
Fine-tuning can take time depending on the model size, dataset size, and hyperparameters. Your job will progress through several states: Pending, Queued, Running, Uploading, and Completed. Dashboard Monitoring You can monitor your job on the Together AI jobs dashboard. Check Status via APIUsing Your Fine-tuned Model
Once your fine-tuning job completes, your model will be available for use. You can view your fine-tuned models in your models dashboard.Option 1: Serverless LoRA Inference
If you used LoRA fine-tuning, your model will be instantly available for use without deployment:Option 2: Dedicated Endpoint Deployment
You can also deploy your fine-tuned VLM on a dedicated endpoint for production use:- Visit your models dashboard
- Find your fine-tuned model and click ”+ CREATE DEDICATED ENDPOINT”
- Select your hardware configuration and scaling options
- Click “DEPLOY”