Quick Start
Here’s how to get started with basic text-to-speech:speech.mp3 file.
Available Models
Together AI supports multiple text-to-speech models:| Organization | Model Name | Model String for API | API Endpoint Support |
|---|---|---|---|
| Canopy Labs | Orpheus 3B | canopylabs/orpheus-3b-0.1-ft | Rest, Streaming, WebSocket |
| Kokoro | Kokoro | hexgrad/Kokoro-82M | Rest, Streaming, WebSocket |
| Cartesia | Cartesia Sonic 3 | cartesia/sonic-3 | Rest, Streaming, WebSocket |
| Cartesia | Cartesia Sonic 2 | cartesia/sonic-2 | Rest, Streaming, WebSocket |
| Cartesia | Cartesia Sonic | cartesia/sonic | Rest, Streaming, WebSocket |
| Deepgram | Aura 2 (Dedicated Endpoint only) | deepgram/deepgram-aura-2 | Rest, Streaming, WebSocket |
| Rime | Arcana v3 Turbo (Dedicated Endpoint only) | rime-labs/rime-arcana-v3-turbo | Rest, Streaming, WebSocket |
| Rime | Arcana v3 (Dedicated Endpoint only) | rime-labs/rime-arcana-v3 | Rest, Streaming, WebSocket |
| Rime | Arcana v2 (Dedicated Endpoint only) | rime-labs/rime-arcana-v2 | Rest, Streaming, WebSocket |
| Rime | Mist v3 (Beta) (Dedicated Endpoint only) | rime-labs/rime-mist-v3 | Rest, Streaming, WebSocket |
| Rime | Mist v2 (Dedicated Endpoint only) | rime-labs/rime-mist-v2 | Rest, Streaming, WebSocket |
| Minimax | Speech 2.6 Turbo (Dedicated Endpoint only) | minimax/speech-2.6-turbo | Rest, Streaming, WebSocket |
- Orpheus, Kokoro, and Cartesia models support real-time WebSocket streaming for lowest latency applications.
- To use Cartesia models, you need to be at Build Tier 2 or higher. Cartesia Sonic 2 and Sonic 3 are also available on Dedicated and Reserved Endpoints.
- Deepgram Aura 2 is available on Dedicated and Reserved Endpoints only.
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
| model | string | Yes | The TTS model to use |
| input | string | Yes | The text to generate audio for |
| voice | string | Yes | The voice to use for generation. See Voices section |
| response_format | string | No | Output format: mp3, wav, raw (PCM), mulaw (μ-law). Minimax model also supports opus, aac, and flac. Default: wav |
| sample_rate | integer | No | The sample rate of the output audio in Hz (e.g., 24000, 44100) |
| language | string | No | The language code for speech synthesis (e.g., en, fr, es) |
| alignment | string | No | Controls word-level timestamp generation. Set to word to receive word timestamps, or none to disable (default: none) |
| segment | string | No | Controls how text is segmented before synthesis. Options: sentence (default), immediate, never |
Word alignment (
alignment=word) is only supported for streaming requests.Streaming Audio
For real-time applications where Time-To-First-Byte (TTFB) is critical, use streaming mode:stream: true, the API returns a stream of server-sent events:
Audio chunk:
alignment=word):
raw (PCM) format is supported. For non-streaming, you can use mp3, wav, or raw.
WebSocket API
For the lowest latency and most interactive applications, use the WebSocket API. This allows you to stream text input and receive audio chunks in real-time. Establishing a Connection Connect to:wss://api.together.ai/v1/audio/speech/websocket
Authentication:
- Include your API key as a query parameter:
?api_key=YOUR_API_KEY - Or use the
Authorizationheader when establishing the WebSocket connection
alignment=word is set. Contains word-level timing information for the generated audio.
| Parameter | Type | Description |
|---|---|---|
| model | string | The TTS model to use |
| voice | string | The voice for generation |
| response_format | string | Audio format: mp3, opus, aac, flac, wav, or pcm |
| speed | float | Playback speed (default: 1.0) |
| max_partial_length | integer | Character buffer length before triggering TTS generation |
| sample_rate | integer | The sample rate of the output audio in Hz (e.g., 24000, 44100) |
| language | string | The language code for speech synthesis (e.g., en, fr, es) |
| alignment | string | Controls word-level timestamp generation. Set to word to receive conversation.item.word_timestamps events, or none to disable (default: none) |
| segment | string | Controls how text is segmented before synthesis. Options: sentence (default) splits on sentence boundaries, immediate processes text as soon as it arrives, never waits until buffer is committed |
You can pass these query parameters either in the WebSocket URL (e.g.,
wss://api.together.ai/v1/audio/speech/websocket?model=hexgrad/Kokoro-82M&voice=af_alloy&sample_rate=24000&alignment=word) or dynamically via the tts_session.updated event after the connection is established.Output Raw Bytes
If you want to extract out raw audio bytes use the settings below:test2.pcm file.
Response Formats
Together AI supports multiple audio formats:| Format | Extension | Description | Streaming Support |
|---|---|---|---|
| wav | .wav | Uncompressed audio (larger file size) | No |
| mp3 | .mp3 | Compressed audio (smaller file size) | No |
| raw | .pcm | Raw PCM audio data | Yes |
| mulaw | .ulaw | Uses logarithmic compression to optimize speech quality for telephony | Yes |
Best Practices
Choosing the Right Delivery Method- Basic HTTP API: Best for batch processing or when you need complete audio files
- Streaming HTTP API: Best for real-time applications where TTFB matters
- WebSocket API: Best for interactive applications requiring lowest latency (chatbots, live assistants)
- Use streaming when you need the fastest time-to-first-byte
- Use WebSocket API for conversational applications
- Buffer text appropriately - sentence boundaries work best for natural speech
- Use the
max_partial_lengthparameter in WebSocket to control buffer behavior - Consider using
raw(PCM) format for lowest latency, then encode client-side if needed
- Test different voices to find the best match for your application
- Some voices are better suited for specific content types (narration vs conversation)
- Use the Voices API to discover all available options
- Explore our API Reference for detailed parameter documentation
- Learn about Speech-to-Text for the reverse operation
- Check out our PDF to Podcast guide for a complete example
Supported Voices
Some of the supported voices for each model are shown below. For the full list of available voices, query the/v1/voices endpoint.
Voices API
Text
Text
Text
Text
Rime Arcana v3 and Arcana v3 Turbo are multilingual models.
Text
Text
Pricing
| Model | Price |
|---|---|
| Orpheus 3B | $15 per 1 Million characters |
| Kokoro | $4 per 1 Million characters |
| Cartesia Sonic 2 | $65 per 1 Million characters |