Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.together.ai/llms.txt

Use this file to discover all available pages before exploring further.

Using a coding agent? Install the together-audio skill to let your agent write correct text-to-speech code automatically. See agent skills for details.
Together AI hosts text-to-speech models with multiple delivery methods. Use this page for the basics: making a request, picking a model, and configuring parameters. For real-time delivery, see Streaming and WebSocket.
Want to hear it in action? Call (847) 851-4323 to talk to a live voice agent powered by Together AI’s real-time STT and TTS pipeline. Then read the end-to-end guide to build your own.

Quickstart

Basic text-to-speech request:
from together import Together

client = Together()

speech_file_path = "speech.mp3"

with client.audio.speech.with_streaming_response.create(
    model="canopylabs/orpheus-3b-0.1-ft",
    input="Today is a wonderful day to build something people love!",
    voice="tara",
    response_format="mp3",
) as response:
    response.stream_to_file(speech_file_path)
Outputs a speech.mp3 file.

Available models

For the current list of text-to-speech models, see the serverless catalog or the dedicated endpoint model catalog.

Parameters

ParameterTypeRequiredDescription
modelstringYesThe TTS model to use
inputstringYesThe text to generate audio for
voicestringYesThe voice to use for generation. See Voices section
response_formatstringNoOutput format: mp3, wav, raw (PCM), mulaw (μ-law). Minimax model also supports opus, aac, and flac. Default: wav
sample_rateintegerNoThe sample rate of the output audio in Hz (e.g., 24000, 44100)
bit_rateintegerNoMP3 bitrate in bits per second. Only applies when response_format is mp3. Valid values: 32000, 64000, 96000, 128000, 192000. Default: 128000. Currently supported on Cartesia models.
languagestringNoThe language or locale code for speech synthesis (e.g., en, fr, es). Locales are supported and must be lowercase (e.g., zh-hk for Cantonese)
alignmentstringNoControls word-level timestamp generation. Set to word to receive word timestamps, or none to disable (default: none)
segmentstringNoControls how text is segmented before synthesis. Options: sentence (default), immediate, never
extra_paramsobjectNoAdditional model-specific parameters. Supported fields:
pronunciation_dictarrayNoA list of pronunciation rules for specific characters or symbols. Each entry uses the format "<source>/<replacement>" (e.g., ["omg/oh my god"]) to override how the model pronounces matching tokens.
Word alignment (alignment=word) is only supported for streaming requests.
For the full set of parameters refer to the API reference for /audio/speech.

Response formats

Together AI supports multiple audio formats:
FormatExtensionDescriptionStreaming Support
wav.wavUncompressed audio (larger file size)No
mp3.mp3Compressed audio (smaller file size)No
raw.pcmRaw PCM audio dataYes
mulaw.ulawUses logarithmic compression to optimize speech quality for telephonyYes

Best practices

Choose the right delivery method

  • Basic HTTP API: Best for batch processing or when you need complete audio files.
  • Streaming HTTP API: Best for real-time applications where TTFB matters. See Streaming.
  • WebSocket API: Best for interactive applications requiring lowest latency (chatbots, live assistants). See WebSocket.

Performance tips

  • Use streaming when you need the fastest time-to-first-byte.
  • Use the WebSocket API for conversational applications.
  • Buffer text appropriately. Sentence boundaries work best for natural speech.
  • Use the max_partial_length parameter in WebSocket to control buffer behavior.
  • Consider using raw (PCM) format for lowest latency, then encode client-side if needed.

Voice selection

  • Test different voices to find the best match for your application.
  • Some voices are better suited for specific content types (narration vs conversation).
  • Use the Voices API to discover all available options.

Supported voices

Some of the supported voices for each model are shown below. For the full list of available voices, query the /v1/voices endpoint.

Voices API

from together import Together

client = Together()

# List all available voices
response = client.audio.voices.list()

for model_voices in response.data:
    print(f"Model: {model_voices.model}")
    for voice in model_voices.voices:
        print(f"  - Voice: {voice.name}")

Available voices

Orpheus model

Sample voices include:
Text
`tara`
`leah`
`jess`
`leo`
`dan`
`mia`
`zac`
`zoe`

Kokoro model

Text
af_heart
af_alloy
af_aoede
af_bella
af_jessica
af_kore
af_nicole
af_nova
af_river
af_sarah
af_sky
am_adam
am_echo
am_eric
am_fenrir
am_liam
am_michael
am_onyx
am_puck
am_santa
bf_alice
bf_emma
bf_isabella
bf_lily
bm_daniel
bm_fable
bm_george
bm_lewis
jf_alpha
jf_gongitsune
jf_nezumi
jf_tebukuro
jm_kumo
zf_xiaobei
zf_xiaoni
zf_xiaoxiao
zf_xiaoyi
zm_yunjian
zm_yunxi
zm_yunxia
zm_yunyang
ef_dora
em_alex
em_santa
ff_siwis
hf_alpha
hf_beta
hm_omega
hm_psi
if_sara
im_nicola
pf_dora
pm_alex
pm_santa
Voice mixing (Kokoro only)
Kokoro supports combining two or more voices into a single blended voice by joining their names with +. This can be useful for creating custom voice characteristics that aren’t available from any single voice on its own.
  • Equal weights: af_bella+af_heart blends the two voices in equal proportion.
  • Custom weights: af_bella(2)+af_heart(1) weights af_bella twice as heavily as af_heart. Weights can be integers or decimals.
  • More than two voices: af_bella(1)+af_heart(1)+am_adam(0.5). Any number of components is supported.
Voice mixing is only supported for hexgrad/Kokoro-82M. Other TTS models require a single voice name.
Example:
cURL
curl -X POST "https://api.together.ai/v1/audio/speech" \
     -H "Authorization: Bearer $TOGETHER_API_KEY" \
     -H "Content-Type: application/json" \
     -d '{
       "model": "hexgrad/Kokoro-82M",
       "input": "The quick brown fox jumps over the lazy dog",
       "voice": "af_bella+af_heart",
       "response_format": "mp3"
     }' \
     --output speech.mp3

Cartesia models

All valid voice model strings:
Text
'german conversational woman',
'nonfiction man',
'friendly sidekick',
'french conversational lady',
'french narrator lady',
'german reporter woman',
'indian lady',
'british reading lady',
'british narration lady',
'japanese children book',
'japanese woman conversational',
'japanese male conversational',
'reading lady',
'newsman',
'child',
'meditation lady',
'maria',
"1920's radioman",
'newslady',
'calm lady',
'helpful woman',
'mexican woman',
'korean narrator woman',
'russian calm lady',
'russian narrator man 1',
'russian narrator man 2',
'russian narrator woman',
'hinglish speaking lady',
'italian narrator woman',
'polish narrator woman',
'chinese female conversational',
'pilot over intercom',
'chinese commercial man',
'french narrator man',
'spanish narrator man',
'reading man',
'new york man',
'friendly french man',
'barbershop man',
'indian man',
'australian customer support man',
'friendly australian man',
'wise man',
'friendly reading man',
'customer support man',
'dutch confident man',
'dutch man',
'hindi reporter man',
'italian calm man',
'italian narrator man',
'swedish narrator man',
'polish confident man',
'spanish-speaking storyteller man',
'kentucky woman',
'chinese commercial woman',
'middle eastern woman',
'hindi narrator woman',
'sarah',
'sarah curious',
'laidback woman',
'reflective woman',
'helpful french lady',
'pleasant brazilian lady',
'customer support lady',
'british lady',
'wise lady',
'australian narrator lady',
'indian customer support lady',
'swedish calm lady',
'spanish narrator lady',
'salesman',
'yogaman',
'movieman',
'wizardman',
'australian woman',
'korean calm woman',
'friendly german man',
'announcer man',
'wise guide man',
'midwestern man',
'kentucky man',
'brazilian young man',
'chinese call center man',
'german reporter man',
'confident british man',
'southern man',
'classy british man',
'polite man',
'mexican man',
'korean narrator man',
'turkish narrator man',
'turkish calm man',
'hindi calm man',
'hindi narrator man',
'polish narrator man',
'polish young man',
'alabama male',
'australian male',
'anime girl',
'japanese man book',
'sweet lady',
'commercial lady',
'teacher lady',
'princess',
'commercial man',
'asmr lady',
'professional woman',
'tutorial man',
'calm french woman',
'new york woman',
'spanish-speaking lady',
'midwestern woman',
'sportsman',
'storyteller lady',
'spanish-speaking man',
'doctor mischief',
'spanish-speaking reporter man',
'young spanish-speaking woman',
'the merchant',
'stern french man',
'madame mischief',
'german storyteller man',
'female nurse',
'german conversation man',
'friendly brazilian man',
'german woman',
'southern woman',
'british customer support lady',
'chinese woman narrator',
'pleasant man',
'california girl',
'john',
'anna'

Rime Mist v2, v3 models

Text
'cove'
'lagoon'
'mari'
'moon'
'moraine'
'peak'
'summit'
'talon'
'thunder'
'tundra'
'wildflower'

Rime Arcana v2, v3, and v3 Turbo models

Rime Arcana v3 and Arcana v3 Turbo are multilingual models.
Text
'albion'
'arcade'
'astra'
'atrium'
'bond'
'cupola'
'eliphas'
'estelle'
'eucalyptus'
'fern'
'lintel'
'luna'
'lyra'
'marlu'
'masonry'
'moss'
'oculus'
'parapet'
'pilaster'
'sirius'
'stucco'
'transom'
'truss'
'vashti'
'vespera'
'walnut'

Minimax Speech 2.6 Turbo model

Sample voices include:
Text
'English_DeterminedMan'
'English_Diligent_Man'
'English_expressive_narrator'
'English_FriendlyNeighbor'
'English_Graceful_Lady'
'Japanese_GentleButler'

Pricing

ModelPrice
Orpheus 3B$15 per 1M characters
Kokoro$4 per 1M characters
Cartesia Sonic 2$65 per 1M characters

Next steps

  • Streaming: stream audio over HTTP for low time-to-first-byte, plus how to extract raw PCM bytes.
  • WebSocket API: stream text in and audio out over a single WebSocket for the lowest interactive latency, including multi-context support.
  • API reference for detailed parameter documentation.
  • Speech-to-text for the reverse operation.
  • PDF to Podcast guide for a complete example.