Llama 4 Quickstart
How to get the most out of the new Llama 4 models.
Together AI offers day 1 support for the new Llama 4 multilingual vision models that can analyze multiple images and respond to queries about them.
Register for a Together AI account to get an API key. New accounts come with free credits to start.
Install the Together AI library for your preferred language.
How to use Llama 4 Models
from together import Together
client = Together() # API key via api_key param or TOGETHER_API_KEY env var
# Query image with Llama 4 Maverick model
response = client.chat.completions.create(
model="meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What can you see in this image?"},
{"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png"}}
]
}]
)
print(response.choices[0].message.content)
import Together from "together-ai";
const together = new Together(); // API key via apiKey param or TOGETHER_API_KEY env var
async function main() {
const response = await together.chat.completions.create({
model: "Llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
messages: [{
role: "user",
content: [
{ type: "text", text: "What can you see in this image?" },
{ type: "image_url", image_url: { url: "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png" }}
]
}]
});
console.log(response.choices[0].message.content);
}
main();
Output
The image depicts a serene landscape of Yosemite National Park, featuring a river flowing through a valley surrounded by towering cliffs and lush greenery.
* **River:**
* The river is calm and peaceful, with clear water that reflects the surrounding scenery.
* It flows gently from the bottom-left corner to the center-right of the image.
* The riverbank is lined with rocks and grasses, adding to the natural beauty of the scene.
* **Cliffs:**
* The cliffs are massive and imposing, rising steeply from the valley floor.
* They are composed of light-colored rock, possibly granite, and feature vertical striations.
* The cliffs are covered in trees and shrubs, which adds to their rugged charm.
* **Trees and Vegetation:**
* The valley is densely forested, with tall trees growing along the riverbanks and on the cliffsides.
* The trees are a mix of evergreen and deciduous species, with some displaying vibrant green foliage.
* Grasses and shrubs grow in the foreground, adding texture and color to the scene.
* **Sky:**
* The sky is a brilliant blue, with only a few white clouds scattered across it.
* The sun appears to be shining from the right side of the image, casting a warm glow over the scene.
In summary, the image presents a breathtaking view of Yosemite National Park, showcasing the natural beauty of the valley and its surroundings. The calm river, towering cliffs, and lush vegetation all contribute to a sense of serenity and wonder.
Llama 4 Model Details
Llama 4 Maverick
- Model String: meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8
- Specs:
- 17B active parameters (400B total)
- 128-expert MoE architecture
- 524,288 context length (will be increased to 1M)
- Support for 12 languages: Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese
- Multimodal capabilities (text + images)
- Support Function Calling
- Best for: Enterprise applications, multilingual support, advanced document intelligence
- Knowledge Cutoff: August 2024
Llama 4 Scout
- Model String: meta-llama/Llama-4-Scout-17B-16E-Instruct
- Specs:
- 17B active parameters (109B total)
- 16-expert MoE architecture
- 327,680 context length (will be increased to 10M)
- Support for 12 languages: Arabic, English, French, German, Hindi, Indonesian, Italian, Portuguese, Spanish, Tagalog, Thai, and Vietnamese
- Multimodal capabilities (text + images)
- Support Function Calling
- Best for: Multi-document analysis, codebase reasoning, and personalized tasks
- Knowledge Cutoff: August 2024
Function Calling
import os
import json
import openai
client = openai.OpenAI(
base_url = "https://api.together.xyz/v1",
api_key = os.environ['TOGETHER_API_KEY'],
)
tools = [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA"
},
"unit": {
"type": "string",
"enum": [
"celsius",
"fahrenheit"
]
}
}
}
}
}
]
messages = [
{"role": "system", "content": "You are a helpful assistant that can access external functions. The responses from these function calls will be appended to this dialogue. Please provide responses based on the information from these function calls."},
{"role": "user", "content": "What is the current temperature of New York, San Francisco and Chicago?"}
]
response = client.chat.completions.create(
model="meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
messages=messages,
tools=tools,
tool_choice="auto",
)
print(json.dumps(response.choices[0].message.model_dump()['tool_calls'], indent=2))
Output
[
{
"id": "call_1p75qwks0etzfy1g6noxvsgs",
"function": {
"arguments": "{\"location\":\"New York, NY\",\"unit\":\"fahrenheit\"}",
"name": "get_current_weather"
},
"type": "function"
},
{
"id": "call_aqjfgn65d0c280fjd3pbzpc6",
"function": {
"arguments": "{\"location\":\"San Francisco, CA\",\"unit\":\"fahrenheit\"}",
"name": "get_current_weather"
},
"type": "function"
},
{
"id": "call_rsg8muko8hymb4brkycu3dm5",
"function": {
"arguments": "{\"location\":\"Chicago, IL\",\"unit\":\"fahrenheit\"}",
"name": "get_current_weather"
},
"type": "function"
}
]
Query models with multiple images
Currently this model supports 5 images as input.
# Multi-modal message with multiple images
response = client.chat.completions.create(
model="meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": "Compare these two images."
},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png"
}
},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/slack.png"
}
}
]
}]
)
print(response.choices[0].message.content)
Output
The first image is a collage of multiple identical landscape photos showing a natural scene with rocks, trees, and a stream under a blue sky. The second image is a screenshot of a mobile app interface, specifically the navigation menu of the Canva app, which includes icons for Home, DMs (Direct Messages), Activity, Later, Canvases, and More.
### Comparison:
1. **Content**:
- The first image focuses on a natural landscape.
- The second image shows a digital interface from an app.
2. **Purpose**:
- The first image could be used for showcasing nature, design elements in graphic work, or as a background.
- The second image represents the functionality and layout of the Canva app's navigation system.
3. **Visual Style**:
- The first image has vibrant colors and realistic textures typical of outdoor photography.
- The second image uses flat design icons with a simple color palette suited for user interface design.
4. **Context**:
- The first image is likely intended for artistic or environmental contexts.
- The second image is relevant to digital design and app usability discussions.
Llama 4 Use-cases
Llama 4 Maverick:
- Instruction following and Long context ICL: Very consistent in following precise instructions with in-context learning across very long contexts
- Multilingual customer support: Process support tickets with screenshots in 12 languages to quickly diagnose technical issues
- Multimodal capabilities: Particularly strong at OCR and chart/graph interpretation
- Agent/tool calling work: Designed for agentic workflows with consistent tool calling capabilities
Llama 4 Scout:
- Summarization: Excels at condensing information effectively
- Function calling: Performs well in executing predefined functions
- Long context ICL recall: Shows strong ability to recall information from long contexts using in-context learning
- Long Context RAG: Serves as a workhorse model for coding flows and RAG (Retrieval-Augmented Generation) applications
- Cost-efficient: Provides good performance as an affordable long-context model
Updated about 2 hours ago