What models are available for inference on Together?

Together hosts a wide range of open-source models and you can view the latest inference models here.

What is the maximum context window supported by Together models?

The maximum context window varies significantly by model. Refer to the specific model’s documentation or the inference models page for the exact context length supported by each model.

How do I send a request to an inference endpoint?

You can use the OpenAI-compatible API. Example using curl:
curl https://api.together.xyz/v1/chat/completions \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
        "model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
        "messages": [{"role": "user", "content": "Hello!"}]
      }'
More examples in Python and TypeScript are available here.

What kind of latency can I expect for inference requests?

Latency depends on the model and prompt length. Smaller models like Mistral may respond in less than 1 second, while larger MoE models like Mixtral may take several seconds. Prompt caching and streaming can help reduce perceived latency.

Is Together suitable for high-throughput workloads?

Yes. Together supports production-scale inference. For high-throughput applications (e.g., over 100 RPS), contact the Together team for dedicated support and infrastructure.

Does Together support streaming responses?

Yes. You can receive streamed tokens by setting "stream": true in your request. This allows you to begin processing output as soon as it is generated.

Is my data stored or logged?

Together does not store your input or output by default. Temporary caching may be used for performance unless otherwise configured.

Can I run inference in my own VPC or on-premise?

Yes. Together supports private networking VPC-based deployments for enterprise customers requiring data residency or regulatory compliance. Contact us for more information.

Can I use quantized models for faster inference?

Yes. Together hosts some models with quantized weights (e.g., FP8, FP16, INT4) for faster and more memory-efficient inference. Support varies by model.

Can I cache prompts or use speculative decoding?

Yes. Together supports optimizations like prompt caching and speculative decoding for models that allow it, reducing latency and improving throughput.

Do you support function calling or tool use?

Function calling is natively supported for some models (see here) but structured prompting can simulate function-like behavior.

Do you support structured outputs or JSON mode?

Yes, you can use JSON mode to get structured outputs from LLMs like DeepSeek V3 & Llama 3.3. See more here.

How is inference usage billed?

Inference is billed per input and output token, with rates varying by model. Refer to the pricing page for current pricing details.

What happens if I exceed my rate limit or quota?

You will receive a 429 Too Many Requests error. You can request higher limits via the Together dashboard or by contacting support.

Can I run batched or parallel inference requests?

Yes. Together supports batching and high-concurrency usage. You can send parallel requests from your client and take advantage of backend batching. See Batch Inference for more details.

Can I use Together inference with LangChain or LlamaIndex?

Yes. Together is compatible with LangChain via the OpenAI API interface. Set your Together API key and model name in your environment or code. See more about all available integrations: Langchain, LlamaIndex, Hugging Face, Vercel AI SDK.

How does Together ensure the uptime and reliability of its inference endpoints?

Together aims for high reliability, offering 99.9% SLAs for dedicated endpoints.