Together AI hosts speech recognition models including OpenAI’s Whisper and Voxtral for batch transcription and real-time streaming.Documentation Index
Fetch the complete documentation index at: https://docs.together.ai/llms.txt
Use this file to discover all available pages before exploring further.
Quickstart
Basic transcription and translation:Available models
For the current list of speech-to-text models, see the serverless catalog or the dedicated endpoint model catalog.Audio transcription
Audio transcription is speech-to-text in the same language as the source audio..wav(audio/wav).mp3(audio/mpeg).m4a(audio/mp4).webm(audio/webm).flac(audio/flac).ogg(audio/ogg).opus(audio/opus).aac(audio/aac)
Input methods
Path object
Python
File-like object
Python
Remote URL
The Python SDK doesn’t accept a string URL onfile=. To transcribe a remote file, download it first or use the CLI:
Shell
Language support
Specify the audio language using ISO 639-1 language codes:"en": English."es": Spanish."fr": French."de": German."ja": Japanese."zh": Chinese."auto": Auto-detect (default).
Custom prompts
Use prompts to improve transcription accuracy for specific contexts.Prompts are supported only on Whisper-family models (for example,
openai/whisper-large-v3). Other STT models (for example, nvidia/parakeet-tdt-0.6b-v3) accept the field for API compatibility but ignore it.Next steps
- Streaming transcription: real-time WebSocket transcription for low-latency applications.
- Audio translation: translate speech in any language to English text.
- Transcription features: speaker diarization, word-level timestamps, response formats, async support, and best practices.