# Docs for v2 can be found by changing the above selector ^
from together import Together
client = Together(
api_key=os.environ.get("TOGETHER_API_KEY"),
)
file = open("audio.wav", "rb")
response = client.audio.transcriptions.create(
model="openai/whisper-large-v3",
file=file,
)
print(response.text)
{
"text": "Hello, world!"
}Transcribes audio into text
# Docs for v2 can be found by changing the above selector ^
from together import Together
client = Together(
api_key=os.environ.get("TOGETHER_API_KEY"),
)
file = open("audio.wav", "rb")
response = client.audio.transcriptions.create(
model="openai/whisper-large-v3",
file=file,
)
print(response.text)
{
"text": "Hello, world!"
}Bearer authentication header of the form Bearer <token>, where <token> is your auth token.
Audio file upload or public HTTP/HTTPS URL. Supported formats .wav, .mp3, .m4a, .webm, .flac.
Model to use for transcription
openai/whisper-large-v3 Optional ISO 639-1 language code. If auto is provided, language is auto-detected.
"en"
Optional text to bias decoding.
The format of the response
json, verbose_json Sampling temperature between 0.0 and 1.0
0 <= x <= 1Controls level of timestamp detail in verbose_json. Only used when response_format is verbose_json. Can be a single granularity or an array to get multiple levels.
segment, word ["word", "segment"]
Whether to enable speaker diarization. When enabled, you will get the speaker id for each word in the transcription. In the response, in the words array, you will get the speaker id for each word. In addition, we also return the speaker_segments array which contains the speaker id for each speaker segment along with the start and end time of the segment along with all the words in the segment.
For eg - ... "speaker_segments": [
"speaker_id": "SPEAKER_00",
"start": 0,
"end": 30.02,
"words": [
{
"id": 0,
"word": "Tijana",
"start": 0,
"end": 11.475,
"speaker_id": "SPEAKER_00"
},
...
Minimum number of speakers expected in the audio. Used to improve diarization accuracy when the approximate number of speakers is known.
Maximum number of speakers expected in the audio. Used to improve diarization accuracy when the approximate number of speakers is known.
OK
The transcribed text
"Hello, world!"
Was this page helpful?