Skip to main content
Together AI provides comprehensive text-to-speech capabilities with multiple models and delivery methods. This guide covers everything from basic audio generation to real-time streaming via WebSockets.

Quick Start

Here’s how to get started with basic text-to-speech:
from together import Together

client = Together()

speech_file_path = "speech.mp3"

response = client.audio.speech.create(
    model="canopylabs/orpheus-3b-0.1-ft",
    input="Today is a wonderful day to build something people love!",
    voice="tara",
)

response.stream_to_file(speech_file_path)
This will output a speech.mp3 file.

Available Models

Together AI supports multiple text-to-speech models:
OrganizationModel NameModel String for APIAPI Endpoint Support
Canopy LabsOrpheus 3Bcanopylabs/orpheus-3b-0.1-ftRest, Streaming, WebSocket
KokoroKokorohexgrad/Kokoro-82MRest, Streaming, WebSocket
CartesiaCartesia Sonic 2cartesia/sonic-2Rest
CartesiaCartesia Soniccartesia/sonicRest
  • Orpheus and Kokoro models support real-time WebSocket streaming for lowest latency applications.
  • To use Cartesia models, you need to be at Build Tier 2 or higher.

Parameters

ParameterTypeRequiredDescription
modelstringYesThe TTS model to use
inputstringYesThe text to generate audio for
voicestringYesThe voice to use for generation. See Voices section
response_formatstringNoOutput format: mp3, wav, or raw (PCM). Default: wav
For the full set of parameters refer to the API reference for /audio/speech.

Streaming Audio

For real-time applications where Time-To-First-Byte (TTFB) is critical, use streaming mode:
from together import Together

client = Together()

response = client.audio.speech.create(
    model="canopylabs/orpheus-3b-0.1-ft",
    input="The quick brown fox jumps over the lazy dog",
    voice="tara",
    stream=True,
    response_format="raw",  # Required for streaming
    response_encoding="pcm_s16le",  # 16-bit PCM for clean audio
)

# Save the streamed audio to a file
response.stream_to_file("speech_streaming.wav", response_format="wav")
Streaming Response Format: When stream: true, the API returns a stream of events: Delta Event:
{
  "type": "delta",
  "audio": "base64-encoded-audio-data"
}
Completion Event:
{
  "type": "done"
}
Note: When streaming is enabled, only raw (PCM) format is supported. For non-streaming, you can use mp3, wav, or raw.

WebSocket API

For the lowest latency and most interactive applications, use the WebSocket API. This allows you to stream text input and receive audio chunks in real-time.
The WebSocket API is currently only available via raw WebSocket connections. SDK support coming soon.
Establishing a Connection Connect to: wss://api.together.xyz/v1/audio/speech/websocket Authentication:
  • Include your API key as a query parameter: ?api_key=YOUR_API_KEY
  • Or use the Authorization header when establishing the WebSocket connection
Client → Server Messages 1. Append Text to Buffer
{
  "type": "input_text_buffer.append",
  "text": "Hello, this is a test sentence."
}
Appends text to the input buffer. Text is buffered until sentence completion or maximum length is reached. 2. Commit Buffer
{
  "type": "input_text_buffer.commit"
}
Forces processing of all buffered text. Use this at the end of your input stream. 3. Clear Buffer
{
  "type": "input_text_buffer.clear"
}
Clears all buffered text without processing (except text already being processed by the model). 4. Update Session Parameters
{
  "type": "tts_session.updated",
  "session": {
    "voice": "new_voice_id"
  }
}
Updates TTS session settings like voice in real-time. Server → Client Messages Session Created
{
  "event_id": "uuid-string",
  "type": "session.created",
  "session": {
    "id": "session-uuid",
    "object": "realtime.tts.session",
    "modalities": ["text", "audio"],
    "model": "canopylabs/orpheus-3b-0.1-ft",
    "voice": "tara"
  }
}
Text Received Acknowledgment
{
  "type": "conversation.item.input_text.received",
  "text": "Hello, this is a test sentence."
}
Audio Delta (Streaming Chunks)
{
  "type": "conversation.item.audio_output.delta",
  "item_id": "tts_1",
  "delta": "base64-encoded-audio-chunk"
}
Audio Complete
{
  "type": "conversation.item.audio_output.done",
  "item_id": "tts_1"
}
TTS Error
{
  "type": "conversation.item.tts.failed",
  "error": {
    "message": "Error description",
    "type": "error_type",
    "code": "error_code"
  }
}
WebSocket Example
import asyncio
import websockets
import json
import base64
import os


async def generate_speech():
    api_key = os.environ.get("TOGETHER_API_KEY")
    url = "wss://api.together.ai/v1/audio/speech/websocket?model=hexgrad/Kokoro-82M&voice=af_alloy"

    headers = {"Authorization": f"Bearer {api_key}"}

    async with websockets.connect(url, additional_headers=headers) as ws:
        # Wait for session created
        session_msg = await ws.recv()
        session_data = json.loads(session_msg)
        print(f"Session created: {session_data['session']['id']}")

        # Send text for TTS
        text_chunks = [
            "Hello, this is a test.",
            "This is the second sentence.",
            "And this is the final one.",
        ]

        async def send_text():
            for chunk in text_chunks:
                await ws.send(
                    json.dumps(
                        {"type": "input_text_buffer.append", "text": chunk}
                    )
                )
                await asyncio.sleep(0.5)  # Simulate typing

            # Commit to process any remaining text
            await ws.send(json.dumps({"type": "input_text_buffer.commit"}))

        async def receive_audio():
            audio_data = bytearray()
            async for message in ws:
                data = json.loads(message)

                if data["type"] == "conversation.item.input_text.received":
                    print(f"Text received: {data['text']}")
                elif data["type"] == "conversation.item.audio_output.delta":
                    # Decode base64 audio chunk
                    audio_chunk = base64.b64decode(data["delta"])
                    audio_data.extend(audio_chunk)
                    print(f"Received audio chunk for item {data['item_id']}")
                elif data["type"] == "conversation.item.audio_output.done":
                    print(
                        f"Audio generation complete for item {data['item_id']}"
                    )
                elif data["type"] == "conversation.item.tts.failed":
                    error = data.get("error", {})
                    print(f"Error: {error.get('message')}")
                    break

            # Save the audio to a file
            with open("output.wav", "wb") as f:
                f.write(audio_data)
            print("Audio saved to output.wav")

        # Run send and receive concurrently
        await asyncio.gather(send_text(), receive_audio())


asyncio.run(generate_speech())
WebSocket Parameters When establishing a WebSocket connection, you can configure:
ParameterTypeDescription
model_idstringThe TTS model to use
voicestringThe voice for generation
response_formatstringAudio format: mp3, opus, aac, flac, wav, or pcm
speedfloatPlayback speed (default: 1.0)
max_partial_lengthintegerCharacter buffer length before triggering TTS generation

Output Raw Bytes

If you want to extract out raw audio bytes use the settings below:
import requests
import os

url = "https://api.together.xyz/v1/audio/speech"
api_key = os.environ.get("TOGETHER_API_KEY")

headers = {"Authorization": f"Bearer {api_key}"}

data = {
    "input": "This is a test of raw PCM audio output.",
    "voice": "tara",
    "response_format": "raw",
    "response_encoding": "pcm_f32le",
    "sample_rate": 44100,
    "stream": False,
    "model": "canopylabs/orpheus-3b-0.1-ft",
}

response = requests.post(url, headers=headers, json=data)

with open("output_raw.pcm", "wb") as f:
    f.write(response.content)

print(f"✅ Raw PCM audio saved to output_raw.pcm")
print(f"   Size: {len(response.content)} bytes")
This will output a raw bytes test2.pcm file.

Response Formats

Together AI supports multiple audio formats:
FormatExtensionDescriptionStreaming Support
wav.wavUncompressed audio (larger file size)No
mp3.mp3Compressed audio (smaller file size)No
raw.pcmRaw PCM audio dataYes

Best Practices

Choosing the Right Delivery Method
  • Basic HTTP API: Best for batch processing or when you need complete audio files
  • Streaming HTTP API: Best for real-time applications where TTFB matters
  • WebSocket API: Best for interactive applications requiring lowest latency (chatbots, live assistants)
Performance Tips
  • Use streaming when you need the fastest time-to-first-byte
  • Use WebSocket API for conversational applications
  • Buffer text appropriately - sentence boundaries work best for natural speech
  • Use the max_partial_length parameter in WebSocket to control buffer behavior
  • Consider using raw (PCM) format for lowest latency, then encode client-side if needed
Voice Selection
  • Test different voices to find the best match for your application
  • Some voices are better suited for specific content types (narration vs conversation)
  • Use the Voices API to discover all available options
Next Steps

Supported Voices

Different models support different voices. Use the Voices API to discover available voices for each model. Voices API
from together import Together

client = Together()

# List all available voices
response = client.audio.voices.list()

for model_voices in response.data:
    print(f"Model: {model_voices.model}")
    for voice in model_voices.voices:
        print(f"  - Voice: {voice['name']}")
Available Voices Orpheus Model: Sample voices include:
Text
`tara`
`leah`
`jess`
`leo`
`dan`
`mia`
`zac`
`zoe`
For a complete list, query the /v1/voices endpoint or see the Kokoro voice documentation. Kokoro Model:
Text
af_heart
af_alloy
af_aoede
af_bella
af_jessica
af_kore
af_nicole
af_nova
af_river
af_sarah
af_sky
am_adam
am_echo
am_eric
am_fenrir
am_liam
am_michael
am_onyx
am_puck
am_santa
bf_alice
bf_emma
bf_isabella
bf_lily
bm_daniel
bm_fable
bm_george
bm_lewis
jf_alpha
jf_gongitsune
jf_nezumi
jf_tebukuro
jm_kumo
zf_xiaobei
zf_xiaoni
zf_xiaoxiao
zf_xiaoyi
zm_yunjian
zm_yunxi
zm_yunxia
zm_yunyang
ef_dora
em_alex
em_santa
ff_siwis
hf_alpha
hf_beta
hm_omega
hm_psi
if_sara
im_nicola
pf_dora
pm_alex
pm_santa
Cartesia Models: All valid voice model strings:
Text
'german conversational woman',
'nonfiction man',
'friendly sidekick',
'french conversational lady',
'french narrator lady',
'german reporter woman',
'indian lady',
'british reading lady',
'british narration lady',
'japanese children book',
'japanese woman conversational',
'japanese male conversational',
'reading lady',
'newsman',
'child',
'meditation lady',
'maria',
"1920's radioman",
'newslady',
'calm lady',
'helpful woman',
'mexican woman',
'korean narrator woman',
'russian calm lady',
'russian narrator man 1',
'russian narrator man 2',
'russian narrator woman',
'hinglish speaking lady',
'italian narrator woman',
'polish narrator woman',
'chinese female conversational',
'pilot over intercom',
'chinese commercial man',
'french narrator man',
'spanish narrator man',
'reading man',
'new york man',
'friendly french man',
'barbershop man',
'indian man',
'australian customer support man',
'friendly australian man',
'wise man',
'friendly reading man',
'customer support man',
'dutch confident man',
'dutch man',
'hindi reporter man',
'italian calm man',
'italian narrator man',
'swedish narrator man',
'polish confident man',
'spanish-speaking storyteller man',
'kentucky woman',
'chinese commercial woman',
'middle eastern woman',
'hindi narrator woman',
'sarah',
'sarah curious',
'laidback woman',
'reflective woman',
'helpful french lady',
'pleasant brazilian lady',
'customer support lady',
'british lady',
'wise lady',
'australian narrator lady',
'indian customer support lady',
'swedish calm lady',
'spanish narrator lady',
'salesman',
'yogaman',
'movieman',
'wizardman',
'australian woman',
'korean calm woman',
'friendly german man',
'announcer man',
'wise guide man',
'midwestern man',
'kentucky man',
'brazilian young man',
'chinese call center man',
'german reporter man',
'confident british man',
'southern man',
'classy british man',
'polite man',
'mexican man',
'korean narrator man',
'turkish narrator man',
'turkish calm man',
'hindi calm man',
'hindi narrator man',
'polish narrator man',
'polish young man',
'alabama male',
'australian male',
'anime girl',
'japanese man book',
'sweet lady',
'commercial lady',
'teacher lady',
'princess',
'commercial man',
'asmr lady',
'professional woman',
'tutorial man',
'calm french woman',
'new york woman',
'spanish-speaking lady',
'midwestern woman',
'sportsman',
'storyteller lady',
'spanish-speaking man',
'doctor mischief',
'spanish-speaking reporter man',
'young spanish-speaking woman',
'the merchant',
'stern french man',
'madame mischief',
'german storyteller man',
'female nurse',
'german conversation man',
'friendly brazilian man',
'german woman',
'southern woman',
'british customer support lady',
'chinese woman narrator',
'pleasant man',
'california girl',
'john',
'anna'

Pricing

ModelPrice
Orpheus 3BTBD
KokoroTBD
Cartesia Sonic 2$65 per 1 Million characters