Skip to main content
Together AI provides comprehensive text-to-speech capabilities with multiple models and delivery methods. This guide covers everything from basic audio generation to real-time streaming via WebSockets.
Want to hear it in action? Call (847) 851-4323 to talk to a live voice agent powered by Together AI’s real-time STT and TTS pipeline. Then read the end-to-end guide to build your own.

Quick Start

Here’s how to get started with basic text-to-speech:
from together import Together

client = Together()

speech_file_path = "speech.mp3"

response = client.audio.speech.create(
    model="canopylabs/orpheus-3b-0.1-ft",
    input="Today is a wonderful day to build something people love!",
    voice="tara",
    response_format="mp3",
)

response.stream_to_file(speech_file_path)
This will output a speech.mp3 file.

Available Models

Together AI supports multiple text-to-speech models:
OrganizationModel NameModel String for APIAPI Endpoint Support
Canopy LabsOrpheus 3Bcanopylabs/orpheus-3b-0.1-ftRest, Streaming, WebSocket
KokoroKokorohexgrad/Kokoro-82MRest, Streaming, WebSocket
CartesiaCartesia Sonic 3cartesia/sonic-3Rest, Streaming, WebSocket
CartesiaCartesia Sonic 2cartesia/sonic-2Rest, Streaming, WebSocket
CartesiaCartesia Soniccartesia/sonicRest, Streaming, WebSocket
DeepgramAura 2 (Dedicated Endpoint only)deepgram/deepgram-aura-2Rest, Streaming, WebSocket
RimeArcana v3 Turbo (Dedicated Endpoint only)rime-labs/rime-arcana-v3-turboRest, Streaming, WebSocket
RimeArcana v3 (Dedicated Endpoint only)rime-labs/rime-arcana-v3Rest, Streaming, WebSocket
RimeArcana v2 (Dedicated Endpoint only)rime-labs/rime-arcana-v2Rest, Streaming, WebSocket
RimeMist v3 (Beta) (Dedicated Endpoint only)rime-labs/rime-mist-v3Rest, Streaming, WebSocket
RimeMist v2 (Dedicated Endpoint only)rime-labs/rime-mist-v2Rest, Streaming, WebSocket
MinimaxSpeech 2.6 Turbo (Dedicated Endpoint only)minimax/speech-2.6-turboRest, Streaming, WebSocket
  • Orpheus, Kokoro, and Cartesia models support real-time WebSocket streaming for lowest latency applications.
  • To use Cartesia models, you need to be at Build Tier 2 or higher. Cartesia Sonic 2 and Sonic 3 are also available on Dedicated and Reserved Endpoints.
  • Deepgram Aura 2 is available on Dedicated and Reserved Endpoints only.

Parameters

ParameterTypeRequiredDescription
modelstringYesThe TTS model to use
inputstringYesThe text to generate audio for
voicestringYesThe voice to use for generation. See Voices section
response_formatstringNoOutput format: mp3, wav, raw (PCM), mulaw (μ-law). Minimax model also supports opus, aac, and flac. Default: wav
sample_rateintegerNoThe sample rate of the output audio in Hz (e.g., 24000, 44100)
languagestringNoThe language code for speech synthesis (e.g., en, fr, es)
alignmentstringNoControls word-level timestamp generation. Set to word to receive word timestamps, or none to disable (default: none)
segmentstringNoControls how text is segmented before synthesis. Options: sentence (default), immediate, never
Word alignment (alignment=word) is only supported for streaming requests.
For the full set of parameters refer to the API reference for /audio/speech.

Streaming Audio

For real-time applications where Time-To-First-Byte (TTFB) is critical, use streaming mode:
from together import Together

client = Together()

response = client.audio.speech.create(
    model="canopylabs/orpheus-3b-0.1-ft",
    input="The quick brown fox jumps over the lazy dog",
    voice="tara",
    stream=True,
    response_format="raw",  # Required for streaming
    response_encoding="pcm_s16le",  # 16-bit PCM for clean audio
)

# Save the streamed audio to a file
response.stream_to_file("speech_streaming.wav", response_format="wav")
Streaming Response Format: When stream: true, the API returns a stream of server-sent events: Audio chunk:
data: {"type":"conversation.item.audio_output.delta","item_id":"tts_1","delta":"<base64-encoded-audio>"}
Word timestamps (when alignment=word):
data: {"type":"conversation.item.word_timestamps","words":["Hello","world"],"start_seconds":[0.0,0.4],"end_seconds":[0.4,0.8]}
Stream end:
data: [DONE]
Note: When streaming is enabled, only raw (PCM) format is supported. For non-streaming, you can use mp3, wav, or raw.

WebSocket API

For the lowest latency and most interactive applications, use the WebSocket API. This allows you to stream text input and receive audio chunks in real-time.
The WebSocket API is currently only available via raw WebSocket connections. SDK support coming soon.
Establishing a Connection Connect to: wss://api.together.ai/v1/audio/speech/websocket Authentication:
  • Include your API key as a query parameter: ?api_key=YOUR_API_KEY
  • Or use the Authorization header when establishing the WebSocket connection
Client → Server Messages 1. Append Text to Buffer
{
  "type": "input_text_buffer.append",
  "text": "Hello, this is a test sentence."
}
Appends text to the input buffer. Text is buffered until sentence completion or maximum length is reached. 2. Commit Buffer
{
  "type": "input_text_buffer.commit"
}
Forces processing of all buffered text. Use this at the end of your input stream. 3. Clear Buffer
{
  "type": "input_text_buffer.clear"
}
Clears all buffered text without processing (except text already being processed by the model). 4. Update Session Parameters
{
  "type": "tts_session.updated",
  "session": {
    "voice": "new_voice_id"
  }
}
Updates TTS session settings like voice in real-time. Server → Client Messages Session Created
{
  "event_id": "uuid-string",
  "type": "session.created",
  "session": {
    "id": "session-uuid",
    "object": "realtime.tts.session",
    "modalities": ["text", "audio"],
    "model": "canopylabs/orpheus-3b-0.1-ft",
    "voice": "tara"
  }
}
Text Received Acknowledgment
{
  "type": "conversation.item.input_text.received",
  "text": "Hello, this is a test sentence."
}
Audio Delta (Streaming Chunks)
{
  "type": "conversation.item.audio_output.delta",
  "item_id": "tts_1",
  "delta": "base64-encoded-audio-chunk"
}
Audio Complete
{
  "type": "conversation.item.audio_output.done",
  "item_id": "tts_1"
}
Word Timestamps Sent when alignment=word is set. Contains word-level timing information for the generated audio.
{
  "type": "conversation.item.word_timestamps",
  "item_id": "tts_1",
  "words": ["Hello", "world"],
  "start_seconds": [0.0, 0.4],
  "end_seconds": [0.4, 0.8]
}
TTS Error
{
  "type": "conversation.item.tts.failed",
  "error": {
    "message": "Error description",
    "type": "error_type",
    "code": "error_code"
  }
}
WebSocket Example
import asyncio
import aiohttp
import json
import base64
import os


async def generate_speech():
    api_key = os.environ.get("TOGETHER_API_KEY")
    url = (
        "wss://api.together.ai/v1/audio/speech"
        "/websocket?model=hexgrad/Kokoro-82M"
        "&voice=af_alloy"
        "&response_format=pcm"
        "&sample_rate=24000"
    )

    headers = {"Authorization": f"Bearer {api_key}"}

    text_chunks = [
        "Hello, this is a test.",
        "This is the second sentence.",
        "And this is the final one.",
    ]

    audio_chunks = []

    async with aiohttp.ClientSession(headers=headers) as session:
        async with session.ws_connect(url) as ws:
            # Wait for session.created
            msg = await ws.receive()
            session_data = json.loads(msg.data)
            print(f"Session created: {session_data['session']['id']}")

            async def send_text():
                for chunk in text_chunks:
                    await ws.send_json(
                        {
                            "type": "input_text_buffer.append",
                            "text": chunk,
                        }
                    )
                    print(f"Sent: {chunk}")
                    await asyncio.sleep(0.5)
                await ws.send_json({"type": "input_text_buffer.commit"})
                print("Committed")

            async def receive_audio():
                async for msg in ws:
                    if msg.type == aiohttp.WSMsgType.TEXT:
                        data = json.loads(msg.data)
                        mtype = data.get("type", "")

                        if mtype == "conversation.item.audio_output.delta":
                            chunk = base64.b64decode(data.get("delta", ""))
                            audio_chunks.append(chunk)

                        elif mtype == "conversation.item.word_timestamps":
                            words = data.get("words", [])
                            starts = data.get("start_seconds", [])
                            stamps = list(
                                zip(words, [f"{s:.2f}s" for s in starts])
                            )
                            print(f"  timestamps: {stamps}")

                        elif mtype in (
                            "error",
                            "conversation.item.tts.failed",
                        ):
                            err = data.get(
                                "error",
                                data.get("message"),
                            )
                            print(f"Error: {err}")
                            return

                    elif msg.type in (
                        aiohttp.WSMsgType.CLOSE,
                        aiohttp.WSMsgType.CLOSED,
                    ):
                        break

            send_task = asyncio.create_task(send_text())
            recv_task = asyncio.create_task(receive_audio())

            await send_task

            # Wait up to 10s for audio to stop arriving
            deadline = asyncio.get_event_loop().time() + 10
            while asyncio.get_event_loop().time() < deadline:
                await asyncio.sleep(0.1)
                n = len(audio_chunks)
                await asyncio.sleep(0.3)
                if len(audio_chunks) == n:
                    break

            recv_task.cancel()
            try:
                await recv_task
            except asyncio.CancelledError:
                pass

    if audio_chunks:
        pcm = b"".join(audio_chunks)
        with open("output.pcm", "wb") as f:
            f.write(pcm)
        print(
            f"\nAudio saved to output.pcm ({len(pcm):,} bytes, "
            f"{len(pcm)/48000:.1f}s at 24kHz)"
        )
        print("Play with: ffplay -f s16le -ar 24000 output.pcm")
    else:
        print("No audio received")


asyncio.run(generate_speech())
WebSocket Parameters When establishing a WebSocket connection, you can configure:
ParameterTypeDescription
modelstringThe TTS model to use
voicestringThe voice for generation
response_formatstringAudio format: mp3, opus, aac, flac, wav, or pcm
speedfloatPlayback speed (default: 1.0)
max_partial_lengthintegerCharacter buffer length before triggering TTS generation
sample_rateintegerThe sample rate of the output audio in Hz (e.g., 24000, 44100)
languagestringThe language code for speech synthesis (e.g., en, fr, es)
alignmentstringControls word-level timestamp generation. Set to word to receive conversation.item.word_timestamps events, or none to disable (default: none)
segmentstringControls how text is segmented before synthesis. Options: sentence (default) splits on sentence boundaries, immediate processes text as soon as it arrives, never waits until buffer is committed
You can pass these query parameters either in the WebSocket URL (e.g., wss://api.together.ai/v1/audio/speech/websocket?model=hexgrad/Kokoro-82M&voice=af_alloy&sample_rate=24000&alignment=word) or dynamically via the tts_session.updated event after the connection is established.

Output Raw Bytes

If you want to extract out raw audio bytes use the settings below:
import requests
import os

url = "https://api.together.ai/v1/audio/speech"
api_key = os.environ.get("TOGETHER_API_KEY")

headers = {"Authorization": f"Bearer {api_key}"}

data = {
    "input": "This is a test of raw PCM audio output.",
    "voice": "tara",
    "response_format": "raw",
    "response_encoding": "pcm_f32le",
    "sample_rate": 44100,
    "stream": False,
    "model": "canopylabs/orpheus-3b-0.1-ft",
}

response = requests.post(url, headers=headers, json=data)

with open("output_raw.pcm", "wb") as f:
    f.write(response.content)

print(f"✅ Raw PCM audio saved to output_raw.pcm")
print(f"   Size: {len(response.content)} bytes")
This will output a raw bytes test2.pcm file.

Response Formats

Together AI supports multiple audio formats:
FormatExtensionDescriptionStreaming Support
wav.wavUncompressed audio (larger file size)No
mp3.mp3Compressed audio (smaller file size)No
raw.pcmRaw PCM audio dataYes
mulaw.ulawUses logarithmic compression to optimize speech quality for telephonyYes

Best Practices

Choosing the Right Delivery Method
  • Basic HTTP API: Best for batch processing or when you need complete audio files
  • Streaming HTTP API: Best for real-time applications where TTFB matters
  • WebSocket API: Best for interactive applications requiring lowest latency (chatbots, live assistants)
Performance Tips
  • Use streaming when you need the fastest time-to-first-byte
  • Use WebSocket API for conversational applications
  • Buffer text appropriately - sentence boundaries work best for natural speech
  • Use the max_partial_length parameter in WebSocket to control buffer behavior
  • Consider using raw (PCM) format for lowest latency, then encode client-side if needed
Voice Selection
  • Test different voices to find the best match for your application
  • Some voices are better suited for specific content types (narration vs conversation)
  • Use the Voices API to discover all available options
Next Steps

Supported Voices

Some of the supported voices for each model are shown below. For the full list of available voices, query the /v1/voices endpoint. Voices API
from together import Together

client = Together()

# List all available voices
response = client.audio.voices.list()

for model_voices in response.data:
    print(f"Model: {model_voices.model}")
    for voice in model_voices.voices:
        print(f"  - Voice: {voice['name']}")
Available Voices Orpheus Model: Sample voices include:
Text
`tara`
`leah`
`jess`
`leo`
`dan`
`mia`
`zac`
`zoe`
Kokoro Model:
Text
af_heart
af_alloy
af_aoede
af_bella
af_jessica
af_kore
af_nicole
af_nova
af_river
af_sarah
af_sky
am_adam
am_echo
am_eric
am_fenrir
am_liam
am_michael
am_onyx
am_puck
am_santa
bf_alice
bf_emma
bf_isabella
bf_lily
bm_daniel
bm_fable
bm_george
bm_lewis
jf_alpha
jf_gongitsune
jf_nezumi
jf_tebukuro
jm_kumo
zf_xiaobei
zf_xiaoni
zf_xiaoxiao
zf_xiaoyi
zm_yunjian
zm_yunxi
zm_yunxia
zm_yunyang
ef_dora
em_alex
em_santa
ff_siwis
hf_alpha
hf_beta
hm_omega
hm_psi
if_sara
im_nicola
pf_dora
pm_alex
pm_santa
Cartesia Models: All valid voice model strings:
Text
'german conversational woman',
'nonfiction man',
'friendly sidekick',
'french conversational lady',
'french narrator lady',
'german reporter woman',
'indian lady',
'british reading lady',
'british narration lady',
'japanese children book',
'japanese woman conversational',
'japanese male conversational',
'reading lady',
'newsman',
'child',
'meditation lady',
'maria',
"1920's radioman",
'newslady',
'calm lady',
'helpful woman',
'mexican woman',
'korean narrator woman',
'russian calm lady',
'russian narrator man 1',
'russian narrator man 2',
'russian narrator woman',
'hinglish speaking lady',
'italian narrator woman',
'polish narrator woman',
'chinese female conversational',
'pilot over intercom',
'chinese commercial man',
'french narrator man',
'spanish narrator man',
'reading man',
'new york man',
'friendly french man',
'barbershop man',
'indian man',
'australian customer support man',
'friendly australian man',
'wise man',
'friendly reading man',
'customer support man',
'dutch confident man',
'dutch man',
'hindi reporter man',
'italian calm man',
'italian narrator man',
'swedish narrator man',
'polish confident man',
'spanish-speaking storyteller man',
'kentucky woman',
'chinese commercial woman',
'middle eastern woman',
'hindi narrator woman',
'sarah',
'sarah curious',
'laidback woman',
'reflective woman',
'helpful french lady',
'pleasant brazilian lady',
'customer support lady',
'british lady',
'wise lady',
'australian narrator lady',
'indian customer support lady',
'swedish calm lady',
'spanish narrator lady',
'salesman',
'yogaman',
'movieman',
'wizardman',
'australian woman',
'korean calm woman',
'friendly german man',
'announcer man',
'wise guide man',
'midwestern man',
'kentucky man',
'brazilian young man',
'chinese call center man',
'german reporter man',
'confident british man',
'southern man',
'classy british man',
'polite man',
'mexican man',
'korean narrator man',
'turkish narrator man',
'turkish calm man',
'hindi calm man',
'hindi narrator man',
'polish narrator man',
'polish young man',
'alabama male',
'australian male',
'anime girl',
'japanese man book',
'sweet lady',
'commercial lady',
'teacher lady',
'princess',
'commercial man',
'asmr lady',
'professional woman',
'tutorial man',
'calm french woman',
'new york woman',
'spanish-speaking lady',
'midwestern woman',
'sportsman',
'storyteller lady',
'spanish-speaking man',
'doctor mischief',
'spanish-speaking reporter man',
'young spanish-speaking woman',
'the merchant',
'stern french man',
'madame mischief',
'german storyteller man',
'female nurse',
'german conversation man',
'friendly brazilian man',
'german woman',
'southern woman',
'british customer support lady',
'chinese woman narrator',
'pleasant man',
'california girl',
'john',
'anna'
Rime Mist v2, v3 Models:
Text
'cove'
'lagoon'
'mari'
'moon'
'moraine'
'peak'
'summit'
'talon'
'thunder'
'tundra'
'wildflower'
Rime Arcana v2, v3, and v3 Turbo Models:
Rime Arcana v3 and Arcana v3 Turbo are multilingual models.
Text
'albion'
'arcade'
'astra'
'atrium'
'bond'
'cupola'
'eliphas'
'estelle'
'eucalyptus'
'fern'
'lintel'
'luna'
'lyra'
'marlu'
'masonry'
'moss'
'oculus'
'parapet'
'pilaster'
'sirius'
'stucco'
'transom'
'truss'
'vashti'
'vespera'
'walnut'
Minimax Speech 2.6 Turbo Model: Sample voices include:
Text
'English_DeterminedMan'
'English_Diligent_Man'
'English_expressive_narrator'
'English_FriendlyNeighbor'
'English_Graceful_Lady'
'Japanese_GentleButler'

Pricing

ModelPrice
Orpheus 3B$15 per 1 Million characters
Kokoro$4 per 1 Million characters
Cartesia Sonic 2$65 per 1 Million characters