Supported models
| Model | Full fine-tuning | LoRA fine-tuning |
|---|---|---|
Qwen/Qwen3-VL-8B-Instruct | ✅ | ✅ |
Qwen/Qwen3-VL-30B-A3B-Instruct | ✅ | ✅ |
Qwen/Qwen3-VL-235B-A22B-Instruct | ❌ | ✅ |
meta-llama/Llama-4-Maverick-17B-128E-Instruct-VLM | ❌ | ✅ |
meta-llama/Llama-4-Scout-17B-16E-Instruct-VLM | ❌ | ✅ |
google/gemma-3-4b-it-VLM | ✅ | ✅ |
google/gemma-3-12b-it-VLM | ✅ | ✅ |
google/gemma-3-27b-it-VLM | ✅ | ✅ |
Prepare your data
Prepare data in a JSONL file, where each line represents one example, with messages that contain text and images. Constraints include:- Image encoding: Each image is base64-encoded with a MIME type prefix (
data:image/jpeg;base64,...). If your data references URLs, download and encode the images first. - Per-example image limit: 10 images.
- Per-image size: 10 MB.
- Supported formats: PNG, JPEG, WEBP.
user messages can contain images.
Conversational format
Instruction format
Preference format
Convert image URLs to base64
Python
Validate and upload
Upload your data using the Together Python/TypeScript SDK or the Together CLI:Launch the job
By default, fine-tuning only updates language-model parameters. Passtrain_vision=True to also update the vision encoder. The trade-off here is that training the encoder costs more compute and is rarely necessary unless your domain images are extremely dissimilar from the pretraining data.
lora=False. For details on all available parameters, see the API reference.
Watch and deploy
VLM jobs use the same lifecycle as text jobs:- Poll the job with the SDK or CLI. Expect 15 to 60 minutes for a LoRA job on an 8B model with a few thousand examples, and several hours for a full job on a 30B model.
- Deploy the result on a dedicated endpoint.