Documentation Index
Fetch the complete documentation index at: https://docs.together.ai/llms.txt
Use this file to discover all available pages before exploring further.
Create an endpoint
- CLI / SDK
- UI
First, list available hardware options for your model:You’ll see output similar to this:Then create the endpoint, using the ID for your preferred hardware option:These are the two fields you’ll use the most:
The endpoint starts in
Shell
Output
A successful create returns the new endpoint object:| Field | Example | What it’s for |
|---|---|---|
id | endpoint-d23901de-ef8f-44bf-b3e7-de9c1ca8f2d7 | The unique identifier for the endpoint. Pass this as endpoint_id to all management operations: inspect, start, stop, update, and delete. |
name | devuser/Qwen/Qwen3.5-9B-FP8-a32b82a1 | The model identifier you pass as the model parameter when calling inference APIs. It includes your username, the base model, and a unique suffix so you can run multiple deployments of the same base model side by side. |
PENDING and moves to STARTED once provisioning finishes (typically a few minutes). You can call the inference API as soon as the state is STARTED.Target an availability zone
If you have latency or geographic constraints, you can target a specific availability zone. Only do this if you need to, since it can limit hardware availability.Inspect an endpoint
Get the current state and configuration of an endpoint by ID:List your endpoints
Start, stop, and restart
Stopping an endpoint pauses billing. Restarting brings it back online without re-provisioning hardware (subject to availability).- CLI / SDK
- UI
Update endpoint settings
You can change replica counts on a running endpoint without re-creating it. Bothmin_replicas and max_replicas must be supplied together.
Delete an endpoint
Deletion is permanent. Stopped endpoints incur no charges, so prefer stopping unless you want to completely remove the endpoint.Troubleshooting
Endpoints don’t always deploy immediately. Here are the most common reasons:- Low availability: Hardware may be available but only enough for a partial replica count. The endpoint starts but scales to the available count. If your minimum replica count is higher than current capacity, the endpoint stays queued until capacity recovers. To avoid the wait, lower the minimum replica count.
- Hardware unavailable error: If you see “Hardware for endpoint not available now, please try again later”, the hardware you selected is fully claimed. Try a comparable model on different hardware (use whichllm.together.ai to find substitutes), or retry later.
- Model not supported: Not every model is available for dedicated endpoint deployments. For a list of deployable models, see the dedicated endpoint model catalog. A fine-tuned model can only deploy on a dedicated endpoint if its base model is supported.
Next steps
Quickstart
Deploy and call your first endpoint in 5 minutes.
Available models
Browse the list of available models for instant deployment.
Endpoint settings
Configure endpoint hardware, autoscaling, decoding, prompt caching.
Scaling
Learn how endpoints scale and when to use vertical vs. horizontal scaling.