Quick Start
Here’s how to get started with basic text-to-speech:speech.mp3 file.
Available Models
Together AI supports multiple text-to-speech models:| Organization | Model Name | Model String for API | API Endpoint Support |
|---|---|---|---|
| Canopy Labs | Orpheus 3B | canopylabs/orpheus-3b-0.1-ft | Rest, Streaming, WebSocket |
| Kokoro | Kokoro | hexgrad/Kokoro-82M | Rest, Streaming, WebSocket |
| Cartesia | Cartesia Sonic 2 | cartesia/sonic-2 | Rest |
| Cartesia | Cartesia Sonic | cartesia/sonic | Rest |
| Rime | Arcana v2 (Dedicated Endpoint only) | rime-labs/rime-arcana-v2 | Rest, Streaming, WebSocket |
| Minimax | Speech 2.6 Turbo (Dedicated Endpoint only) | minimax/speech-2.6-turbo | Rest, Streaming, WebSocket |
| Rime | Mist v2 (Dedicated Endpoint only) | rime-labs/rime-mist-v2 | Rest, Streaming, WebSocket |
- Orpheus and Kokoro models support real-time WebSocket streaming for lowest latency applications.
- To use Cartesia models, you need to be at Build Tier 2 or higher.
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
| model | string | Yes | The TTS model to use |
| input | string | Yes | The text to generate audio for |
| voice | string | Yes | The voice to use for generation. See Voices section |
| response_format | string | No | Output format: mp3, wav, raw (PCM), mulaw (μ-law). Minimax model also supports opus, aac, and flac. Default: wav |
Streaming Audio
For real-time applications where Time-To-First-Byte (TTFB) is critical, use streaming mode:stream: true, the API returns a stream of events:
Delta Event:
raw (PCM) format is supported. For non-streaming, you can use mp3, wav, or raw.
WebSocket API
For the lowest latency and most interactive applications, use the WebSocket API. This allows you to stream text input and receive audio chunks in real-time. Establishing a Connection Connect to:wss://api.together.xyz/v1/audio/speech/websocket
Authentication:
- Include your API key as a query parameter:
?api_key=YOUR_API_KEY - Or use the
Authorizationheader when establishing the WebSocket connection
| Parameter | Type | Description |
|---|---|---|
| model_id | string | The TTS model to use |
| voice | string | The voice for generation |
| response_format | string | Audio format: mp3, opus, aac, flac, wav, or pcm |
| speed | float | Playback speed (default: 1.0) |
| max_partial_length | integer | Character buffer length before triggering TTS generation |
Output Raw Bytes
If you want to extract out raw audio bytes use the settings below:test2.pcm file.
Response Formats
Together AI supports multiple audio formats:| Format | Extension | Description | Streaming Support |
|---|---|---|---|
| wav | .wav | Uncompressed audio (larger file size) | No |
| mp3 | .mp3 | Compressed audio (smaller file size) | No |
| raw | .pcm | Raw PCM audio data | Yes |
| mulaw | .ulaw | Uses logarithmic compression to optimize speech quality for telephony | Yes |
Best Practices
Choosing the Right Delivery Method- Basic HTTP API: Best for batch processing or when you need complete audio files
- Streaming HTTP API: Best for real-time applications where TTFB matters
- WebSocket API: Best for interactive applications requiring lowest latency (chatbots, live assistants)
- Use streaming when you need the fastest time-to-first-byte
- Use WebSocket API for conversational applications
- Buffer text appropriately - sentence boundaries work best for natural speech
- Use the
max_partial_lengthparameter in WebSocket to control buffer behavior - Consider using
raw(PCM) format for lowest latency, then encode client-side if needed
- Test different voices to find the best match for your application
- Some voices are better suited for specific content types (narration vs conversation)
- Use the Voices API to discover all available options
- Explore our API Reference for detailed parameter documentation
- Learn about Speech-to-Text for the reverse operation
- Check out our PDF to Podcast guide for a complete example
Supported Voices
Different models support different voices. Use the Voices API to discover available voices for each model. Voices APIText
/v1/voices endpoint or see the Kokoro voice documentation.
Kokoro Model:
Text
Text
Text
Text
Text
/v1/voices endpoint with the model parameter:
Pricing
| Model | Price |
|---|---|
| Orpheus 3B | $15 per 1 Million characters |
| Kokoro | $4 per 1 Million characters |
| Cartesia Sonic 2 | $65 per 1 Million characters |