Create an Audio Transcription

# Docs for v2 can be found by changing the above selector ^
from together import Together

client = Together(
    api_key=os.environ.get("TOGETHER_API_KEY"),
)

file = open("audio.wav", "rb")

response = client.audio.transcriptions.create(
    model="openai/whisper-large-v3",
    file=file,
)

print(response.text)

{
  "text": "Hello, world!"
}

POST

audio

transcriptions

# Docs for v2 can be found by changing the above selector ^
from together import Together

client = Together(
    api_key=os.environ.get("TOGETHER_API_KEY"),
)

file = open("audio.wav", "rb")

response = client.audio.transcriptions.create(
    model="openai/whisper-large-v3",
    file=file,
)

print(response.text)

{
  "text": "Hello, world!"
}

Authorizations

Authorization

string

header

default:default

required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

multipart/form-data

file

required

Audio file upload or public HTTP/HTTPS URL. Supported formats .wav, .mp3, .m4a, .webm, .flac.

model

enum<string>

default:openai/whisper-large-v3

Model to use for transcription

Available options:

openai/whisper-large-v3

language

string

default:en

Optional ISO 639-1 language code. If auto is provided, language is auto-detected.

Example:

"en"

prompt

string

Optional text to bias decoding.

response_format

enum<string>

default:json

The format of the response

Available options:

json,

verbose_json

temperature

number<float>

default:0

Sampling temperature between 0.0 and 1.0

Required range: 0 <= x <= 1

timestamp_granularities

default:segment

Controls level of timestamp detail in verbose_json. Only used when response_format is verbose_json. Can be a single granularity or an array to get multiple levels.

Available options:

segment,

word

Example:

["word", "segment"]

diarize

boolean

default:false

Whether to enable speaker diarization. When enabled, you will get the speaker id for each word in the transcription. In the response, in the words array, you will get the speaker id for each word. In addition, we also return the speaker_segments array which contains the speaker id for each speaker segment along with the start and end time of the segment along with all the words in the segment.

For eg - ... "speaker_segments": [ "speaker_id": "SPEAKER_00", "start": 0, "end": 30.02, "words": [ { "id": 0, "word": "Tijana", "start": 0, "end": 11.475, "speaker_id": "SPEAKER_00" }, ...

min_speakers

integer

Minimum number of speakers expected in the audio. Used to improve diarization accuracy when the approximate number of speakers is known.

max_speakers

integer

Maximum number of speakers expected in the audio. Used to improve diarization accuracy when the approximate number of speakers is known.

Response

Option 1
Option 2

text

string

required

The transcribed text

Example:

"Hello, world!"

Create realtime text-to-speech

Create an Audio Translation

⌘I

Together APIs

Command Line Interface

General

Create an Audio Transcription

Authorizations

Body

Response