Vision-language models accept images alongside text and reply in natural language, structured JSON, or tool calls. For the current list of vision-capable models, see the serverless catalog or the dedicated endpoint model catalog.Documentation Index
Fetch the complete documentation index at: https://docs.together.ai/llms.txt
Use this file to discover all available pages before exploring further.
Basic example
Pass amessages array where the user content is a list mixing text and image_url blocks. The model treats them as a single multimodal prompt and replies with text in choices[0].message.content. The example below points the model at an image of a Trello board and asks it to describe the UI in detail; the response streams back token-by-token.
Sample model output
Sample model output
Pricing
Vision models bill images as input tokens. Each image breaks into a tile grid (capped at 2×2 of 560-pixel tiles) and you pay 1,601 tokens per tile. There are only four possible image bills:| Image size (W × H) | Tile grid | Image tokens |
|---|---|---|
| Up to 559 × 559 | 1 × 1 | 1,601 |
| Up to 559 tall, wider than 560 | 1 × 2 | 3,202 |
| Taller than 560, up to 559 wide | 2 × 1 | 3,202 |
| Wider than 560 and taller than 560 | 2 × 2 | 6,404 |