What models are available for inference on Together?
Together hosts a wide range of open-source models and you can view the latest inference models here.What is the maximum context window supported by Together models?
The maximum context window varies significantly by model. Refer to the specific model’s documentation or the inference models page for the exact context length supported by each model.How do I send a request to an inference endpoint?
You can use the OpenAI-compatible API. Example using curl:What kind of latency can I expect for inference requests?
Latency depends on the model and prompt length. Smaller models like Mistral may respond in less than 1 second, while larger MoE models like Mixtral may take several seconds. Prompt caching and streaming can help reduce perceived latency.Is Together suitable for high-throughput workloads?
Yes. Together supports production-scale inference. For high-throughput applications (e.g., over 100 RPS), contact the Together team for dedicated support and infrastructure.Does Together support streaming responses?
Yes. You can receive streamed tokens by setting"stream": true
in your request. This allows you to begin processing output as soon as it is generated.