Chat

Learn how to query our open-source chat models.

You can use Together's APIs to send individual queries or have long-running conversations with chat models. You can also configure a system prompt to customize how a model should respond.

Queries run against a model of your choice. For most use cases, we recommend using Meta Llama 3.

Running a single query

Use chat.completions.create to send a single query to a chat model:

import os
from together import Together

client = Together()

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct-Turbo",
    messages=[{"role": "user", "content": "What are some fun things to do in New York?"}],
)

print(response.choices[0].message.content)
import Together from "together-ai";

const together = new Together();

const response = await together.chat.completions.create({
  model: "meta-llama/Meta-Llama-3-8B-Instruct-Turbo",
  messages: [{ role: "user", content: "What are some fun things to do in New York?" }],
});

console.log(response.choices[0].message.content)
curl -X POST "https://api.together.xyz/v1/chat/completions" \
     -H "Authorization: Bearer $TOGETHER_API_KEY" \
     -H "Content-Type: application/json" \
     -d '{
     	"model": "meta-llama/Meta-Llama-3-8B-Instruct-Turbo",
     	"messages": [
     		{"role": "user", "content": "What are some fun things to do in New York?"}
     	]
     }'

The create method takes in a model name and a messages array. Each message is an object that has the content of the query, as well as a role for the message's author.

In the example above, you can see that we're using "user" for the role. The "user" role tells the model that this message comes from the end user of our system – for example, a customer using your chatbot app.

The other two roles are "assistant" and "system", which we'll talk about next.

Having a long-running conversation

Every query to a chat model is self-contained. This means that new queries won't automatically have access to any queries that may have come before them. This is exactly why the "assistant" role exists.

The "assistant" role is used to provide historical context for how a model has responded to prior queries. This makes it perfect for building apps that have long-running conversations, like chatbots.

To provide a chat history for a new query, pass the previous messages to the messages array, denoting the user-provided queries with the "user" role, and the model's responses with the "assistant" role:

import os
from together import Together

client = Together()

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct-Turbo",
    messages=[
      {"role": "user", "content": "What are some fun things to do in New York?"},
      {"role": "assistant", "content": "You could go to the Empire State Building!"},
      {"role": "user", "content": "That sounds fun! Where is it?"},
    ],
)

print(response.choices[0].message.content)
import Together from "together-ai";

const together = new Together();

const response = await together.chat.completions.create({
  model: "meta-llama/Meta-Llama-3-8B-Instruct-Turbo",
  messages: [
    { role: "user", content: "What are some fun things to do in New York?" },
    { role: "assistant", content: "You could go to the Empire State Building!"},
    { role: "user", content: "That sounds fun! Where is it?" },
  ],
});

console.log(response.choices[0].message.content);
curl -X POST "https://api.together.xyz/v1/chat/completions" \
     -H "Authorization: Bearer $TOGETHER_API_KEY" \
     -H "Content-Type: application/json" \
     -d '{
     	"model": "meta-llama/Meta-Llama-3-8B-Instruct-Turbo",
     	"messages": [
        {"role": "user", "content": "What are some fun things to do in New York?"},
        {"role": "assistant", "content": "You could go to the Empire State Building!"},
        {"role": "user", "content": "That sounds fun! Where is it?" }
     	]
     }'

How your app stores historical messages is up to you.

Customizing how the model responds

While you can query a model just by providing a user message, typically you'll want to give your model some context for how you'd like it to respond. For example, if you're building a chatbot to help your customers with travel plans, you might want to tell your model that it should act like a helpful travel guide.

To do this, provide an initial message that uses the "system" role:

import os
from together import Together

client = Together()

response = client.chat.completions.create(
    model="meta-llama/Llama-3-8b-chat-hf",
    messages=[
      {"role": "system", "content": "You are a helpful travel guide."},
      {"role": "user", "content": "What are some fun things to do in New York?"},
    ],
)

print(response.choices[0].message.content)
import Together from "together-ai";

const together = new Together();

const response = await together.chat.completions.create({
  model: "meta-llama/Llama-3-8b-chat-hf",
  messages: [
    {"role": "system", "content": "You are a helpful travel guide."},
    { role: "user", content: "What are some fun things to do in New York?" },
  ],
});

console.log(response.choices[0].message.content);
curl -X POST "https://api.together.xyz/v1/chat/completions" \
     -H "Authorization: Bearer $TOGETHER_API_KEY" \
     -H "Content-Type: application/json" \
     -d '{
     	"model": "meta-llama/Llama-3-8b-chat-hf",
     	"messages": [
     		{"role": "system", "content": "You are a helpful travel guide."},
     		{"role": "user", "content": "What are some fun things to do in New York?"}
     	]
     }'

Streaming responses

Since models can take some time to respond to a query, Together's APIs support streaming back responses in chunks. This lets you display results from each chunk while the model is still running, instead of having to wait for the entire response to finish.

To return a stream, set the stream option to true. (If using HTTP, the option name is stream_tokens.)

import os
from together import Together

client = Together()

stream = client.chat.completions.create(
  model="meta-llama/Llama-3-8b-chat-hf",
  messages=[{"role": "user", "content": "What are some fun things to do in New York?"}],
  stream=True,
)

for chunk in stream:
  print(chunk.choices[0].delta.content or "", end="", flush=True)
import Together from 'together-ai';

const together = new Together();

const stream = await together.chat.completions.create({
  model: 'meta-llama/Llama-3-8b-chat-hf',
  messages: [
    { role: 'user', content: 'What are some fun things to do in New York?' },
  ],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || '');
}
curl -X POST "https://api.together.xyz/v1/chat/completions" \
     -H "Authorization: Bearer $TOGETHER_API_KEY" \
     -H "Content-Type: application/json" \
     -d '{
     	"model": "meta-llama/Llama-3-8b-chat-hf",
     	"messages": [
     		{"role": "user", "content": "What are some fun things to do in New York?"}
     	],
      "stream_tokens": true
     }'
     
# Response will be a stream of Server-Sent Events with JSON-encoded payloads. For example:
# 
# data: {"choices":[{"index":0,"delta":{"content":" A"}}],"id":"85ffbb8a6d2c4340-EWR","token":{"id":330,"text":" A","logprob":1,"special":false},"finish_reason":null,"generated_text":null,"stats":null,"usage":null,"created":1709700707,"object":"chat.completion.chunk"}
# data: {"choices":[{"index":0,"delta":{"content":":"}}],"id":"85ffbb8a6d2c4340-EWR","token":{"id":28747,"text":":","logprob":0,"special":false},"finish_reason":null,"generated_text":null,"stats":null,"usage":null,"created":1709700707,"object":"chat.completion.chunk"}
# data: {"choices":[{"index":0,"delta":{"content":" Sure"}}],"id":"85ffbb8a6d2c4340-EWR","token":{"id":12875,"text":" Sure","logprob":-0.00724411,"special":false},"finish_reason":null,"generated_text":null,"stats":null,"usage":null,"created":1709700707,"object":"chat.completion.chunk"}

A note on async support in Python

Since I/O in Python is synchronous, multiple queries will execute one after another in sequence, even if they are independent.

If you have multiple independent calls that you want to run in parallel, you can use our Python library's AsyncTogether module:

import os, asyncio
from together import AsyncTogether

async_client = AsyncTogether()
messages = [
    "What are the top things to do in San Francisco?",
    "What country is Paris in?",
]

async def async_chat_completion(messages):
    async_client = AsyncTogether(api_key=os.environ.get("TOGETHER_API_KEY"))
    tasks = [
        async_client.chat.completions.create(
            model="mistralai/Mixtral-8x7B-Instruct-v0.1",
            messages=[{"role": "user", "content": message}],
        )
        for message in messages
    ]
    responses = await asyncio.gather(*tasks)

    for response in responses:
        print(response.choices[0].message.content)

asyncio.run(async_chat_completion(messages))