Structured extraction with vision

You can combine vision input with structured outputs to extract typed data from an image. Pass an image_url content block and a response_format with a JSON schema; the model returns JSON that conforms to the schema. For example, you could extract a project name and a column count from a screenshot of a Trello board:

import json
from together import Together
from pydantic import BaseModel, Field

client = Together()


class ImageDescription(BaseModel):
    project_name: str = Field(
        description="The name of the project shown in the image"
    )
    col_num: int = Field(description="The number of columns in the board")


image_url = "https://napkinsdev.s3.us-east-1.amazonaws.com/next-s3-uploads/d96a3145-472d-423a-8b79-bca3ad7978dd/trello-board.png"

extract = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Extract a JSON object from the image.",
                },
                {"type": "image_url", "image_url": {"url": image_url}},
            ],
        }
    ],
    model="moonshotai/Kimi-K2.5",
    reasoning={"enabled": False},
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "image_description",
            "schema": ImageDescription.model_json_schema(),
        },
    },
)

print(json.dumps(json.loads(extract.choices[0].message.content), indent=2))

Example output:

JSON

{
  "projectName": "Project A",
  "columnCount": 4
}

For the full structured-outputs reference, see Structured outputs.

Vision input modes

Vision-language function calling

⌘I

Documentation Index