Quickstart
Basic transcription and translation:Available models
The following speech-to-text models are available:| Organization | Model | Model string for API | Serverless | Dedicated |
|---|---|---|---|---|
| OpenAI | Whisper Large v3 | openai/whisper-large-v3 | ✅ | ✅ |
| NVIDIA | Parakeet TDT 0.6B v3 | nvidia/parakeet-tdt-0.6b-v3 | ✅ | ✅ |
| NVIDIA | Nemotron 3 ASR Streaming 0.6B | nvidia/nemotron-3-asr-streaming-0.6b | ✅ | ✅ |
| NVIDIA | Nemotron 3.5 ASR Streaming 0.6B | nvidia/nemotron-3.5-asr-streaming-0.6b | ✅ | ✅ |
| Deepgram | Nova-3 (English) | deepgram/nova-3-en | ❌ | ✅ |
| Deepgram | Nova-3 Multilingual | deepgram/nova-3-multi | ❌ | ✅ |
| Deepgram | Flux | deepgram/flux | ❌ | ✅ |
Limits
| Limit | Value | Notes |
|---|---|---|
| Max request size (direct upload) | 500 MB | Requests above this are rejected at the edge with HTTP 413 Payload Too Large. For anything larger, host the audio at a public HTTPS URL and pass that URL as the file field instead. |
| Max file size (URL fetch) | 1 GB | When you submit an HTTPS URL instead of binary, the server downloads up to 1 GB. Larger downloads fail with 400 file_too_large. |
| Max audio duration | 4 hours per request | Longer audio is rejected with 400 audio_too_long. Split into ≤ 4 h segments and submit separately. |
| Supported formats | .wav, .mp3, .m4a, .webm, .flac, .ogg, .opus, .aac |
file field instead of a binary upload. The 500 MB edge cap only applies to direct uploads. See Errors and troubleshooting for the full list of error codes.
Audio transcription
Audio transcription is speech-to-text in the same language as the source audio..wav(audio/wav).mp3(audio/mpeg).m4a(audio/mp4).webm(audio/webm).flac(audio/flac).ogg(audio/ogg).opus(audio/opus).aac(audio/aac)
Audio limits
The same limits apply to both/v1/audio/transcriptions and /v1/audio/translations:
- Maximum duration: 4 hours. Longer audio is rejected with an
audio_too_longerror. - Binary uploads: Capped at 500 MB. Larger uploads return HTTP 413. Submit the audio via an HTTPS URL on the
filefield instead. - URL-fetched audio: Capped at 1 GB and 4 hours when you pass a public HTTPS URL as
file.
model form field before the file field in the multipart body so the server can dispatch the request without buffering the full audio payload.
Input methods
Path object
Python
File-like object
Python
Remote URL
The Python SDK doesn’t accept a string URL onfile=. To transcribe a remote file, download it first.
Language support
Specify the audio language using ISO 639-1 language codes:"en": English."es": Spanish."fr": French."de": German."ja": Japanese."zh": Chinese."auto": Auto-detect (default).
Custom prompts
Use prompts to improve transcription accuracy for specific contexts.Prompts are supported only on Whisper-family models (for example,
openai/whisper-large-v3). Other STT models (for example, nvidia/parakeet-tdt-0.6b-v3) accept the field for API compatibility but ignore it.Next steps
- Streaming transcription: real-time WebSocket transcription for low-latency applications.
- Audio translation: translate speech in any language to English text.
- Transcription features: speaker diarization, word-level timestamps, response formats, async support, and best practices.