Transcribe audio - Together AI docs

Using a coding agent? Install the together-audio skill to let your agent write correct speech-to-text code automatically. See agent skills for details.

Together AI hosts speech recognition models including OpenAI’s Whisper and Voxtral for batch transcription and real-time streaming.

Want to hear it in action? Call (847) 851-4323 to talk to a live voice agent powered by Together AI’s real-time STT and TTS pipeline. Then read the end-to-end guide to build your own.

Quickstart

Basic transcription and translation:

from pathlib import Path

from together import Together

client = Together()

## Basic transcription

response = client.audio.transcriptions.create(
    file=Path("audio.mp3"),
    model="openai/whisper-large-v3",
    language="en",
)
print(response.text)

## Basic translation

response = client.audio.translations.create(
    file=Path("foreign_audio.mp3"),
    model="openai/whisper-large-v3",
)
print(response.text)

Available models

For the current list of speech-to-text models, see the serverless catalog or the dedicated endpoint model catalog.

Limits

Limit	Value	Notes
Max file size (direct upload)	500 MB	Requests above this are rejected at the edge with `HTTP 413 Payload Too Large`.
Max file size (URL fetch)	1 GB	When you submit an HTTPS URL instead of binary, the server downloads up to 1 GB. Larger downloads fail with `400 file_too_large`.
Max audio duration	4 hours per request	Longer audio is rejected with `400 audio_too_long`. Split into ≤ 4 h segments and submit separately.
Supported formats	`.wav`, `.mp3`, `.m4a`, `.webm`, `.flac`, `.ogg`, `.opus`, `.aac`

For larger payloads, host the file at a public HTTPS URL and pass that URL as the file field instead of a binary upload — the 500 MB edge cap only applies to direct uploads. See Errors and troubleshooting for the full list of error codes.

Audio transcription

Audio transcription is speech-to-text in the same language as the source audio.

from pathlib import Path

from together import Together

client = Together()

response = client.audio.transcriptions.create(
    file=Path("meeting_recording.mp3"),
    model="openai/whisper-large-v3",
    language="en",
    response_format="json",
)

print(f"Transcription: {response.text}")

The API supports the following audio formats:

.wav (audio/wav)
.mp3 (audio/mpeg)
.m4a (audio/mp4)
.webm (audio/webm)
.flac (audio/flac)
.ogg (audio/ogg)
.opus (audio/opus)
.aac (audio/aac)

Audio limits

The same limits apply to both /v1/audio/transcriptions and /v1/audio/translations:

Maximum duration: 4 hours. Longer audio is rejected with an audio_too_long error.
Binary uploads: capped at 500 MB. Larger uploads return HTTP 413.
URL-fetched audio: capped at 1 GB when you pass a public HTTP/HTTPS URL as file.

For longer recordings, split the audio into chunks before uploading.

Input methods

Path object

Python

from pathlib import Path

response = client.audio.transcriptions.create(
    file=Path("recordings/interview.wav"),
    model="openai/whisper-large-v3",
)

File-like object

Python

with open("audio.mp3", "rb") as audio_file:
    response = client.audio.transcriptions.create(
        file=audio_file,
        model="openai/whisper-large-v3",
    )

Remote URL

The Python SDK doesn’t accept a string URL on file=. To transcribe a remote file, download it first or use the CLI:

Shell

together audio transcribe https://example.com/audio.mp3 \
  --model openai/whisper-large-v3

Language support

Specify the audio language using ISO 639-1 language codes:

from pathlib import Path

response = client.audio.transcriptions.create(
    file=Path("spanish_audio.mp3"),
    model="openai/whisper-large-v3",
    language="es",  # Spanish
)

Common language codes:

"en": English.
"es": Spanish.
"fr": French.
"de": German.
"ja": Japanese.
"zh": Chinese.
"auto": Auto-detect (default).

Custom prompts

Use prompts to improve transcription accuracy for specific contexts.

Prompts are supported only on Whisper-family models (for example, openai/whisper-large-v3). Other STT models (for example, nvidia/parakeet-tdt-0.6b-v3) accept the field for API compatibility but ignore it.

from pathlib import Path

response = client.audio.transcriptions.create(
    file=Path("medical_consultation.mp3"),
    model="openai/whisper-large-v3",
    language="en",
    prompt="This is a medical consultation discussing patient symptoms, diagnosis, and treatment options.",
)

Next steps

Streaming transcription: real-time WebSocket transcription for low-latency applications.
Audio translation: translate speech in any language to English text.
Transcription features: speaker diarization, word-level timestamps, response formats, async support, and best practices.

Documentation Index

​Quickstart

​Available models

​Limits

​Audio transcription

​Audio limits

​Input methods

​Path object

​File-like object

​Remote URL

​Language support

​Custom prompts

​Next steps

Quickstart

Available models

Limits

Audio transcription

Audio limits

Input methods

Path object

File-like object

Remote URL

Language support

Custom prompts

Next steps