Quick Start
Here’s how to get started with basic transcription and translation:Available Models
Together AI supports multiple speech-to-text models:| Organization | Model Name | Model String for API | Capabilities |
|---|---|---|---|
| OpenAI | Whisper Large v3 | openai/whisper-large-v3 | Real-time, Translation, Diarization |
| Mistral AI | Voxtral Mini 3B | mistralai/Voxtral-Mini-3B-2507 |
Audio Transcription
Audio transcription converts speech to text in the same language as the source audio..wav(audio/wav).mp3(audio/mpeg).m4a(audio/mp4).webm(audio/webm).flac(audio/flac)
Python
Python
- “en” - English
- “es” - Spanish
- “fr” - French
- “de” - German
- “ja” - Japanese
- “zh” - Chinese
- “auto” - Auto-detect (default)
Real-time Streaming Transcription
For applications requiring the lowest latency, use the real-time WebSocket API. This provides streaming transcription with incremental results. Establishing a Connection Connect to:wss://api.together.ai/v1/realtime?model={model}&input_audio_format=pcm_s16le_16000
Headers:
| Parameter | Type | Required | Description |
|---|---|---|---|
| model | string | Yes | Model to use (e.g., openai/whisper-large-v3) |
| input_audio_format | string | Yes | Audio format: pcm_s16le_16000 |
Audio Translation
Audio translation converts speech from any language to English text.Speaker Diarization
Enable diarization to identify who is speaking when. If known you can also addmin_speakers and max_speakers expected in the audio to improve the diarization accuracy.
Word-level Timestamps
Get word-level timing information:Text
Response Formats
JSON Format (Default) Returns only the transcribed/translated text:Text
Advanced Features
Temperature Control Adjust randomness in the output (0.0 = deterministic, 1.0 = creative):Async Support
All transcription and translation operations support async/await: Async TranscriptionBest Practices
Choosing the Right Method- Batch Transcription: Best for pre-recorded audio files, podcasts, or any non-real-time use case
- Real-time Streaming: Best for live conversations, voice assistants, or applications requiring immediate feedback
- Use high-quality audio files for better transcription accuracy
- Minimize background noise
- Ensure clear speech with good volume levels
- Use appropriate sample rates (16kHz or higher recommended)
- For WebSocket streaming, use PCM format:
pcm_s16le_16000 - Consider file size limits for uploads
- For long audio files, consider splitting into smaller chunks
- Use streaming for real-time applications when available
- Works best with clear audio and distinct speakers
- Speakers are labeled as SPEAKER_00, SPEAKER_01, etc.
- Use with
verbose_jsonformat to get segment-level speaker information
- Explore our API Reference for detailed parameter documentation
- Learn about Text-to-Speech for the reverse operation
- Check out our Real-time Audio Transcription App guide