Streaming tokens

Reference this guide to learn how to enable streaming tokens in the Together API.

Displaying tokens as the model generates them can often provide a better user experience because users receive feedback much faster. By the end of this tutorial, you should understand how to enable streaming tokens in the Together API.

Enable streaming responses

Use the stream_tokens parameter to enable streaming responses.

When stream_tokens is true, the API returns events as it generates the response instead of waiting for the entire response first.

For example, the curl command below sets "stream_tokens" to true in the request payload.

curl 'https://api.together.xyz/inference' -X POST \
  -H 'Authorization: Bearer API_KEY' \
  -H 'Content-Type: application/json' \
  -d @- <<'EOF'
{ 
  "model": "togethercomputer/RedPajama-INCITE-7B-Instruct",
  "prompt": "Alan Turing was",
  "max_tokens": 128,
  "stop":["\n\n"],
  "temperature":0.7,
  "top_p":0.7,
  "top_k":50,
  "repetition_penalty": 1,
  "stream_tokens": true
}
EOF
  • Replace API_KEYwith your API key in user settings.

Parse the Server-Sent Events from the API

Each event is encoded as a Server-Sent Event with a JSON-encoded payload instead of a single JSON response. For example, the curl command above might see the following events:

data: {"choices":[{"text":" a"}],"result_type":"language-model-inference","id":"e235a724408a86a5f408f437ca26239ddadf509e9dfdf359a645db08ee9a8682"}
data: {"choices":[{"text":" brilliant"}],"result_type":"language-model-inference","id":"e235a724408a86a5f408f437ca26239ddadf509e9dfdf359a645db08ee9a8682"}
data: {"choices":[{"text":" mathematic"}],"result_type":"language-model-inference","id":"e235a724408a86a5f408f437ca26239ddadf509e9dfdf359a645db08ee9a8682"}

Parse the final message

The final message is the string [DONE] and is not encoded as a server-sent event. Be sure to parse it properly instead of treating it as a server-sent event.

Sample code

Putting it all together, the following Python script demonstrates how you might process streaming results using the Python requests and sseclient-py packages.

  1. Install dependencies
pip install requests sseclient-py
  1. Run the following sample code
import json
import os

import requests
import sseclient

url = "https://api.together.xyz/inference"
model = "togethercomputer/RedPajama-INCITE-7B-Chat"
prompt = "Tell me a story\n\n"

print(f"Model: {model}")
print(f"Prompt: {repr(prompt)}")
print("Repsonse:")
print()

payload = {
    "model": model,
    "prompt": prompt,
    "max_tokens": 512,
    "temperature": 0.7,
    "top_p": 0.7,
    "top_k": 50,
    "repetition_penalty": 1,
    "stream_tokens": True,
}
headers = {
    "accept": "application/json",
    "content-type": "application/json",
    "Authorization": f"Bearer {os.environ['TOGETHER_API_KEY']}",
}

response = requests.post(url, json=payload, headers=headers, stream=True)
response.raise_for_status()

client = sseclient.SSEClient(response)
for event in client.events():
    if event.data == "[DONE]":
        break

    partial_result = json.loads(event.data)
    token = partial_result["choices"][0]["text"]
    print(token, end="", flush=True)