Documentation Index
Fetch the complete documentation index at: https://docs.together.ai/llms.txt
Use this file to discover all available pages before exploring further.
Speaker diarization
Enable diarization to identify who is speaking when. If you know the expected speaker count, pass min_speakers and max_speakers to improve accuracy.
from pathlib import Path
from together import Together
client = Together()
response = client.audio.transcriptions.create(
file=Path("meeting.mp3"),
model="openai/whisper-large-v3",
response_format="verbose_json",
diarize="true", # Enable speaker diarization
min_speakers=1,
max_speakers=5,
)
# Access speaker segments
print(response.speaker_segments)
Example response with diarization:
AudioSpeakerSegment(
id=1,
speaker_id='SPEAKER_01',
start=6.268,
end=30.776,
text=(
"Hello. Oh, hey, Justin. How are you doing? ..."
),
words=[
AudioTranscriptionWord(
word='Hello.',
start=6.268,
end=11.314,
id=0,
speaker_id='SPEAKER_01'
),
AudioTranscriptionWord(
word='Oh,',
start=11.834,
end=11.894,
id=1,
speaker_id='SPEAKER_01'
),
AudioTranscriptionWord(
word='hey,',
start=11.914,
end=11.995,
id=2,
speaker_id='SPEAKER_01'
),
...
]
)
Word-level timestamps
Get word-level timing information:
from pathlib import Path
response = client.audio.transcriptions.create(
file=Path("audio.mp3"),
model="openai/whisper-large-v3",
response_format="verbose_json",
timestamp_granularities="word",
)
print(f"Text: {response.text}")
print(f"Language: {response.language}")
print(f"Duration: {response.duration}s")
## Access individual words with timestamps
if response.words:
for word in response.words:
print(f"'{word['word']}' [{word['start']:.2f}s - {word['end']:.2f}s]")
Example output:
Text: It is certain that Jack Pumpkinhead might have had a much finer house to live in.
Language: en
Duration: 7.2562358276643995s
'It' [0.00s - 0.36s]
'is' [0.42s - 0.47s]
'certain' [0.51s - 0.74s]
'that' [0.79s - 0.86s]
'Jack' [0.90s - 1.11s]
'Pumpkinhead' [1.15s - 1.66s]
'might' [1.81s - 2.00s]
'have' [2.04s - 2.13s]
'had' [2.16s - 2.26s]
'a' [2.30s - 2.32s]
'much' [2.36s - 2.48s]
'finer' [2.54s - 2.74s]
'house' [2.78s - 2.93s]
'to' [2.96s - 3.03s]
'live' [3.07s - 3.21s]
'in.' [3.26s - 7.27s]
Returns only the transcribed/translated text:
from pathlib import Path
response = client.audio.transcriptions.create(
file=Path("audio.mp3"),
model="openai/whisper-large-v3",
response_format="json",
)
print(response.text) # "Hello, this is a test recording."
Returns detailed information including timestamps:
from pathlib import Path
response = client.audio.transcriptions.create(
file=Path("audio.mp3"),
model="openai/whisper-large-v3",
response_format="verbose_json",
timestamp_granularities="segment",
)
## Access segments with timestamps
for segment in response.segments:
print(
f"[{segment['start']:.2f}s - {segment['end']:.2f}s]: {segment['text']}"
)
Example output:
[0.11s - 10.85s]: Call is now being recorded. Parker Scarves, how may I help you? Online for my wife, and it turns out they shipped the wrong... Oh, I am so sorry, sir. I got it for her birthday, which is tonight, and now I'm not 100% sure what I need to do. Okay, let me see if I can help. Do you have the item number of the Parker Scarves? I don't think so. Call the New Yorker, I... Excellent. What color do...
[10.88s - 21.73s]: Blue. The one they shipped was light blue. I wanted the darker one. What's the difference? The royal blue is a bit brighter. What zip code are you located in? One nine.
[22.04s - 32.62s]: Karen's Boutique, Termall. Is that close? I'm in my office. Okay, um, what is your name, sir? Charlie. Charlie Johnson. Is that J-O-H-N-S-O-N? And Mr. Johnson, do you have the Parker scarf in light blue with you now? I do. They shipped it to my office. It came in not that long ago. What I will do is make arrangements with Karen's Boutique for...
[32.62s - 41.03s]: you to Parker Scarf at no additional cost. And in addition, I was able to look up your order in our system, and I'm going to send out a special gift to you to make up for the inconvenience. Thank you. You're welcome. And thank you for calling Parker Scarf, and I hope your wife enjoys her birthday gift. Thank you. You're very welcome. Goodbye.
[43.50s - 44.20s]: you
Advanced features
Temperature control
Adjust randomness in the output (0.0 = deterministic, 1.0 = creative):
from pathlib import Path
response = client.audio.transcriptions.create(
file=Path("audio.mp3"),
model="openai/whisper-large-v3",
temperature=0.0, # Most deterministic
)
print(f"Text: {response.text}")
Async support
All transcription and translation operations support async/await:
Async transcription
import asyncio
from pathlib import Path
from together import AsyncTogether
async def transcribe_audio():
client = AsyncTogether()
response = await client.audio.transcriptions.create(
file=Path("audio.mp3"),
model="openai/whisper-large-v3",
language="en",
)
return response.text
## Run async function
result = asyncio.run(transcribe_audio())
print(result)
Async translation
from pathlib import Path
async def translate_audio():
client = AsyncTogether()
response = await client.audio.translations.create(
file=Path("foreign_audio.mp3"),
model="openai/whisper-large-v3",
)
return response.text
result = asyncio.run(translate_audio())
print(result)
Concurrent processing
Process multiple audio files concurrently:
import asyncio
from pathlib import Path
from together import AsyncTogether
async def process_multiple_files():
client = AsyncTogether()
files = [Path("audio1.mp3"), Path("audio2.mp3"), Path("audio3.mp3")]
tasks = [
client.audio.transcriptions.create(
file=file,
model="openai/whisper-large-v3",
)
for file in files
]
responses = await asyncio.gather(*tasks)
for i, response in enumerate(responses):
print(f"File {files[i]}: {response.text}")
asyncio.run(process_multiple_files())
Best practices
Choose the right method
- Batch transcription: Best for pre-recorded audio files, podcasts, or any non-real-time use case.
- Real-time streaming: Best for live conversations, voice assistants, or applications requiring immediate feedback.
Audio quality tips
- Use high-quality audio files for better transcription accuracy.
- Minimize background noise.
- Ensure clear speech with good volume levels.
- Use appropriate sample rates (16kHz or higher recommended).
- For WebSocket streaming, use PCM format:
pcm_s16le_16000.
- Consider file size limits for uploads.
- For long audio files, consider splitting into smaller chunks.
- Use streaming for real-time applications when available.
Diarization best practices
- Works best with clear audio and distinct speakers.
- Speakers are labeled as SPEAKER_00, SPEAKER_01, etc.
- Use with
verbose_json format to get segment-level speaker information.
Next steps