Skip to main content
POST
/
audio
/
speech
# Docs for v1 can be found by changing the above selector ^ from together import Together import os client = Together( api_key=os.environ.get("TOGETHER_API_KEY"), ) response = client.audio.speech.with_streaming_response.create( model="cartesia/sonic-2", input="The quick brown fox jumps over the lazy dog.", voice="laidback woman", ) with response as stream: stream.stream_to_file("audio.wav")
"<string>"

Authorizations

Authorization
string
header
default:default
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json
model
required

The name of the model to query.

See all of Together AI's chat models The current supported tts models are: - cartesia/sonic - hexgrad/Kokoro-82M - canopylabs/orpheus-3b-0.1-ft

Available options:
cartesia/sonic,
hexgrad/Kokoro-82M,
canopylabs/orpheus-3b-0.1-ft
Example:

"canopylabs/orpheus-3b-0.1-ft"

input
string
required

Input text to generate the audio for

voice
string
required

The voice to use for generating the audio. The voices supported are different for each model. For eg - for canopylabs/orpheus-3b-0.1-ft, one of the voices supported is tara, for hexgrad/Kokoro-82M, one of the voices supported is af_alloy and for cartesia/sonic, one of the voices supported is "friendly sidekick".

You can view the voices supported for each model using the /v1/voices endpoint sending the model name as the query parameter. View all supported voices here.

hexgrad/Kokoro-82M additionally supports voice mixing, where two or more voices are combined into a single blended voice by joining their names with + (e.g. af_bella+af_heart). Optional per-voice weights can be provided in parentheses (e.g. af_bella(2)+af_heart(1)). Other models require a single voice name.

response_format
enum<string>
default:wav

The format of audio output. Supported formats are mp3, wav, raw if streaming is false. If streaming is true, the only supported format is raw.

Available options:
mp3,
wav,
raw
language
enum<string>
default:en

Language of input text.

Available options:
en,
de,
fr,
es,
hi,
it,
ja,
ko,
nl,
pl,
pt,
ru,
sv,
tr,
zh
response_encoding
enum<string>
default:pcm_f32le

Audio encoding of response. Only applicable when response_format is raw or pcm. Cartesia models respect this parameter and support all values. Orpheus, Kokoro, and Minimax models always return pcm_s16le regardless of this setting.

Available options:
pcm_f32le,
pcm_s16le,
pcm_mulaw,
pcm_alaw
sample_rate
integer
default:44100

Sampling rate in Hz for the output audio. Cartesia and Minimax models respect this parameter. Orpheus and Kokoro models always output at 24000 Hz regardless of this setting.

bit_rate
enum<integer>
default:128000

Bitrate of the MP3 audio output in bits per second. Only applicable when response_format is mp3. Higher values produce better audio quality at larger file sizes. Default is 128000. Currently supported on Cartesia models.

Available options:
32000,
64000,
96000,
128000,
192000
stream
boolean
default:false

If true, output is streamed for several characters at a time instead of waiting for the full response. The stream terminates with data: [DONE]. If false, return the encoded audio as octet stream

Response

OK

The response is of type file.