Speaker diarization
Enable diarization to identify who is speaking when. If you know the expected speaker count, passmin_speakers and max_speakers to improve accuracy.
Word-level timestamps
Get word-level timing information:Text
Response formats
JSON format (default)
Returns only the transcribed/translated text:Verbose JSON format
Returns detailed information including timestamps:Text
Advanced features
Temperature control
Adjust randomness in the output (0.0 = deterministic, 1.0 = creative):Async support
All transcription and translation operations support async/await:Async transcription
Async translation
Concurrent processing
Process multiple audio files concurrently:Best practices
Choose the right method
- Batch transcription: Best for pre-recorded audio files, podcasts, or any non-real-time use case.
- Real-time streaming: Best for live conversations, voice assistants, or applications requiring immediate feedback.
Audio quality tips
- Use high-quality audio files for better transcription accuracy.
- Minimize background noise.
- Ensure clear speech with good volume levels.
- Use appropriate sample rates (16kHz or higher recommended).
- For WebSocket streaming, use PCM format:
pcm_s16le_16000. - Direct uploads are capped at 500 MB and 4 hours of audio per request; see Limits.
- For long audio files (over 4 hours), split into ≤ 4 h chunks before submitting.
- Use streaming for real-time applications when available.
Diarization best practices
- Works best with clear audio and distinct speakers.
- Speakers are labeled as SPEAKER_00, SPEAKER_01, etc.
- Use with
verbose_jsonformat to get segment-level speaker information.
Errors and troubleshooting
| Response | Meaning | Recommended action |
|---|---|---|
400 audio_too_long | Audio duration exceeds the 4 hour cap. | Split the file into ≤ 4 h segments and submit separately. |
400 file_too_large | A URL-fetched audio download exceeded the 1 GB server-side cap. | Compress the source, or split into smaller files. |
400 unsupported_format | The audio container or codec could not be decoded. | Re-encode to a supported format. Run ffprobe on the file to confirm it is valid audio. |
400 invalid_params | Request parameters failed validation. | Check the API reference. |
413 Payload Too Large | A direct upload exceeded the 500 MB edge limit. | Submit the file via an HTTPS URL on the file field instead, or split the file. The 413 response is plain HTML, not JSON. |
429 | Rate limit exceeded. | See serverless rate limits. |
500 processing_failed | Internal decode failure after the file was accepted. | Verify the file is valid audio with ffprobe. If it is, contact support with the response id. |
Next steps
- See the API reference for detailed parameter documentation.
- Learn about text-to-speech for the reverse operation.
- Check out the real-time audio transcription app guide.