Basic example
Pass amessages array where the user content is a list mixing text and image_url blocks. The model treats them as a single multimodal prompt and replies with text in choices[0].message.content. The example below points the model at an image of a Trello board and asks it to describe the UI in detail; the response streams back token-by-token.
Sample model output
Sample model output
Pricing
Vision models bill images as input tokens. Each image breaks into a tile grid (capped at 2×2 of 560-pixel tiles) and you pay 1,601 tokens per tile. There are only four possible image bills:| Image size (W × H) | Tile grid | Image tokens |
|---|---|---|
| Up to 559 × 559 | 1 × 1 | 1,601 |
| Up to 559 tall, wider than 560 | 1 × 2 | 3,202 |
| Taller than 560, up to 559 wide | 2 × 1 | 3,202 |
| Wider than 560 and taller than 560 | 2 × 2 | 6,404 |