Vision input modes

Beyond a single hosted image URL, vision models accept local files (base64-encoded), video URLs, and multiple images in one prompt. For the basic URL example and supported models, see the Vision overview.

Local images

To query a vision model with a local image:

from together import Together
import base64

client = Together()

getDescriptionPrompt = "what is in the image"

imagePath = "/home/Desktop/dog.jpeg"


def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


base64_image = encode_image(imagePath)

stream = client.chat.completions.create(
    model="moonshotai/Kimi-K2.6",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": getDescriptionPrompt},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{base64_image}"
                    },
                },
            ],
        }
    ],
    stream=True,
)

for chunk in stream:
    print(
        chunk.choices[0].delta.content or "" if chunk.choices else "",
        end="",
        flush=True,
    )

import Together from "together-ai";
import fs from "fs/promises";

const together = new Together();

const getDescriptionPrompt = "what is in the image";

const imagePath = "./dog.jpeg";

async function main() {
  const imageUrl = await fs.readFile(imagePath, { encoding: "base64" });

  const stream = await together.chat.completions.create({
    model: "moonshotai/Kimi-K2.6",
    stream: true,
    messages: [
      {
        role: "user",
        content: [
          { type: "text", text: getDescriptionPrompt },
          {
            type: "image_url",
            image_url: {
              url: `data:image/jpeg;base64,${imageUrl}`,
            },
          },
        ],
      },
    ],
  });

  for await (const chunk of stream) {
    process.stdout.write(chunk.choices[0]?.delta?.content || "");
  }
}

main();

# Replace <BASE64_IMAGE> with your base64-encoded image data.
curl -X POST "https://api.together.ai/v1/chat/completions" \
     -H "Authorization: Bearer $TOGETHER_API_KEY" \
     -H "Content-Type: application/json" \
     -d '{
       "model": "moonshotai/Kimi-K2.6",
       "messages": [
         {
           "role": "user",
           "content": [
             {
               "type": "text",
               "text": "what is in the image"
             },
             {
               "type": "image_url",
               "image_url": {
                 "url": "data:image/jpeg;base64,<BASE64_IMAGE>"
               }
             }
           ]
         }
       ]
     }'

Output

The image contains two dogs sitting close to each other

Video input

Video understanding (passing a video_url content block to a chat completion) is supported on select VLMs that run only as a dedicated endpoint, for example Qwen/Qwen3-VL-8B-Instruct. Spin up a dedicated endpoint, then pass the endpoint name as model and a video_url block alongside text:

Python

from together import Together

client = Together()

response = client.chat.completions.create(
    model="<ACCOUNT>/Qwen/Qwen3-VL-8B-Instruct-<ENDPOINT_HASH>",  # your dedicated endpoint name
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's happening in this video?"},
                {
                    "type": "video_url",
                    "video_url": {
                        "url": "http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerFun.mp4"
                    },
                },
            ],
        }
    ],
)

print(response.choices[0].message.content)

For text-to-video and image-to-video generation (separate from video understanding), see Video generation.

Multiple images

from together import Together

client = Together()

# Multi-modal message with multiple images
response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.6",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Compare these two images."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png"
                    },
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/slack.png"
                    },
                },
            ],
        }
    ],
)

print(response.choices[0].message.content)

// Multi-modal message with multiple images

async function main() {
  const response = await together.chat.completions.create({
    model: "moonshotai/Kimi-K2.6",
    messages: [
      {
        role: "user",
        content: [
          { type: "text", text: "Compare these two images." },
          {
            type: "image_url",
            image_url: {
              url: "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png",
            },
          },
          {
            type: "image_url",
            image_url: {
              url: "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/slack.png",
            },
          },
        ],
      },
    ],
  });
  
  process.stdout.write(response.choices[0]?.message?.content || "");
}

main();

curl -X POST "https://api.together.ai/v1/chat/completions" \
     -H "Authorization: Bearer $TOGETHER_API_KEY" \
     -H "Content-Type: application/json" \
     -d '{
       "model": "moonshotai/Kimi-K2.6",
       "messages": [
         {
           "role": "user",
           "content": [
             {
               "type": "text",
               "text": "Compare these two images."
             },
             {
               "type": "image_url",
               "image_url": {
                 "url": "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png"
               }
             },
             {
               "type": "image_url",
               "image_url": {
                 "url": "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/slack.png"
               }
             }
           ]
         }
       ]
     }'

Sample model output

The first image is a collage of multiple identical landscape photos showing a natural scene with rocks, trees, and a stream under a blue sky. The second image is a screenshot of a mobile app interface, specifically the navigation menu of the Canva app, which includes icons for Home, DMs (Direct Messages), Activity, Later, Canvases, and More.

#### Comparison:
1. **Content**:
   - The first image focuses on a natural landscape.
   - The second image shows a digital interface from an app.

2. **Purpose**:
   - The first image could be used for showcasing nature, design elements in graphic work, or as a background.
   - The second image represents the functionality and layout of the Canva app's navigation system.

3. **Visual Style**:
   - The first image has vibrant colors and realistic textures typical of outdoor photography.
   - The second image uses flat design icons with a simple color palette suited for user interface design.

4. **Context**:
   - The first image is likely intended for artistic or environmental contexts.
   - The second image is relevant to digital design and app usability discussions.

GET STARTED

SERVERLESS

INFERENCE APIS

DEDICATED MODEL INFERENCE

DEDICATED CONTAINER INFERENCE

GPU CLUSTERS

FINE-TUNING

CODE EXECUTION

ADMINISTRATION

Local images

Output

Video input

Multiple images

​Local images

​Output

​Video input

​Multiple images

Local images

Output

Video input

Multiple images