Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.together.ai/llms.txt

Use this file to discover all available pages before exploring further.

Beyond a single hosted image URL, vision models accept local files (base64-encoded), video URLs, and multiple images in one prompt. For the basic URL example and supported models, see the Vision overview.

Local images

To query a vision model with a local image:
from together import Together
import base64

client = Together()

getDescriptionPrompt = "what is in the image"

imagePath = "/home/Desktop/dog.jpeg"


def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


base64_image = encode_image(imagePath)

stream = client.chat.completions.create(
    model="moonshotai/Kimi-K2.5",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": getDescriptionPrompt},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{base64_image}"
                    },
                },
            ],
        }
    ],
    stream=True,
)

for chunk in stream:
    print(
        chunk.choices[0].delta.content or "" if chunk.choices else "",
        end="",
        flush=True,
    )

Output

The Image contains two dogs sitting close to each other

Video input

Video understanding (passing a video_url content block to a chat completion) is supported on select VLMs that run only as a dedicated endpoint, for example Qwen/Qwen3-VL-8B-Instruct. Spin up a dedicated endpoint, then pass the endpoint name as model and a video_url block alongside text:
Python
from together import Together

client = Together()

response = client.chat.completions.create(
    model="<ACCOUNT>/Qwen/Qwen3-VL-8B-Instruct-<ENDPOINT_HASH>",  # your dedicated endpoint name
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's happening in this video?"},
                {
                    "type": "video_url",
                    "video_url": {
                        "url": "http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerFun.mp4"
                    },
                },
            ],
        }
    ],
)

print(response.choices[0].message.content)
For text-to-video and image-to-video generation (separate from video understanding), see Video generation.

Multiple images

from together import Together

client = Together()

# Multi-modal message with multiple images
response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.5",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Compare these two images."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png"
                    },
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/slack.png"
                    },
                },
            ],
        }
    ],
)

print(response.choices[0].message.content)
The first image is a collage of multiple identical landscape photos showing a natural scene with rocks, trees, and a stream under a blue sky. The second image is a screenshot of a mobile app interface, specifically the navigation menu of the Canva app, which includes icons for Home, DMs (Direct Messages), Activity, Later, Canvases, and More.

#### Comparison:
1. **Content**:
   - The first image focuses on a natural landscape.
   - The second image shows a digital interface from an app.

2. **Purpose**:
   - The first image could be used for showcasing nature, design elements in graphic work, or as a background.
   - The second image represents the functionality and layout of the Canva app's navigation system.

3. **Visual Style**:
   - The first image has vibrant colors and realistic textures typical of outdoor photography.
   - The second image uses flat design icons with a simple color palette suited for user interface design.

4. **Context**:
   - The first image is likely intended for artistic or environmental contexts.
   - The second image is relevant to digital design and app usability discussions.