Together AI hosts text-to-speech models with multiple delivery methods. Use this page for the basics: making a request, picking a model, and configuring parameters. For real-time delivery, see Streaming and WebSocket.Documentation Index
Fetch the complete documentation index at: https://docs.together.ai/llms.txt
Use this file to discover all available pages before exploring further.
Quickstart
Basic text-to-speech request:speech.mp3 file.
Available models
For the current list of text-to-speech models, see the serverless catalog or the dedicated endpoint model catalog.Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
| model | string | Yes | The TTS model to use |
| input | string | Yes | The text to generate audio for |
| voice | string | Yes | The voice to use for generation. See Voices section |
| response_format | string | No | Output format: mp3, wav, raw (PCM), mulaw (μ-law). Minimax model also supports opus, aac, and flac. Default: wav |
| sample_rate | integer | No | The sample rate of the output audio in Hz (e.g., 24000, 44100) |
| bit_rate | integer | No | MP3 bitrate in bits per second. Only applies when response_format is mp3. Valid values: 32000, 64000, 96000, 128000, 192000. Default: 128000. Currently supported on Cartesia models. |
| language | string | No | The language or locale code for speech synthesis (e.g., en, fr, es). Locales are supported and must be lowercase (e.g., zh-hk for Cantonese) |
| alignment | string | No | Controls word-level timestamp generation. Set to word to receive word timestamps, or none to disable (default: none) |
| segment | string | No | Controls how text is segmented before synthesis. Options: sentence (default), immediate, never |
| extra_params | object | No | Additional model-specific parameters. Supported fields: |
pronunciation_dict | array | No | A list of pronunciation rules for specific characters or symbols. Each entry uses the format "<source>/<replacement>" (e.g., ["omg/oh my god"]) to override how the model pronounces matching tokens. |
Word alignment (
alignment=word) is only supported for streaming requests.Response formats
Together AI supports multiple audio formats:| Format | Extension | Description | Streaming Support |
|---|---|---|---|
| wav | .wav | Uncompressed audio (larger file size) | No |
| mp3 | .mp3 | Compressed audio (smaller file size) | No |
| raw | .pcm | Raw PCM audio data | Yes |
| mulaw | .ulaw | Uses logarithmic compression to optimize speech quality for telephony | Yes |
Best practices
Choose the right delivery method
- Basic HTTP API: Best for batch processing or when you need complete audio files.
- Streaming HTTP API: Best for real-time applications where TTFB matters. See Streaming.
- WebSocket API: Best for interactive applications requiring lowest latency (chatbots, live assistants). See WebSocket.
Performance tips
- Use streaming when you need the fastest time-to-first-byte.
- Use the WebSocket API for conversational applications.
- Buffer text appropriately. Sentence boundaries work best for natural speech.
- Use the
max_partial_lengthparameter in WebSocket to control buffer behavior. - Consider using
raw(PCM) format for lowest latency, then encode client-side if needed.
Voice selection
- Test different voices to find the best match for your application.
- Some voices are better suited for specific content types (narration vs conversation).
- Use the Voices API to discover all available options.
Supported voices
Some of the supported voices for each model are shown below. For the full list of available voices, query the/v1/voices endpoint.
Voices API
Available voices
Orpheus model
Sample voices include:Text
Kokoro model
Text
Voice mixing (Kokoro only)
Kokoro supports combining two or more voices into a single blended voice by joining their names with+. This can be useful for creating custom voice characteristics that aren’t available from any single voice on its own.
- Equal weights:
af_bella+af_heartblends the two voices in equal proportion. - Custom weights:
af_bella(2)+af_heart(1)weightsaf_bellatwice as heavily asaf_heart. Weights can be integers or decimals. - More than two voices:
af_bella(1)+af_heart(1)+am_adam(0.5). Any number of components is supported.
Voice mixing is only supported for
hexgrad/Kokoro-82M. Other TTS models require a single voice name.cURL
Cartesia models
All valid voice model strings:Text
Rime Mist v2, v3 models
Text
Rime Arcana v2, v3, and v3 Turbo models
Rime Arcana v3 and Arcana v3 Turbo are multilingual models.
Text
Minimax Speech 2.6 Turbo model
Sample voices include:Text
Pricing
| Model | Price |
|---|---|
| Orpheus 3B | $15 per 1M characters |
| Kokoro | $4 per 1M characters |
| Cartesia Sonic 2 | $65 per 1M characters |
Next steps
- Streaming: stream audio over HTTP for low time-to-first-byte, plus how to extract raw PCM bytes.
- WebSocket API: stream text in and audio out over a single WebSocket for the lowest interactive latency, including multi-context support.
- API reference for detailed parameter documentation.
- Speech-to-text for the reverse operation.
- PDF to Podcast guide for a complete example.