Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.together.ai/llms.txt

Use this file to discover all available pages before exploring further.

Create an endpoint

To avoid unexpected charges, you can set an auto-shutdown timer when creating an endpoint. Make sure to review your active deployments periodically in the models dashboard to stop endpoints you’re no longer using.
First, list available hardware options for your model:
together endpoints hardware --model Qwen/Qwen3.5-9B-FP8
You’ll see output similar to this:
Shell
Hardware ID              GPU    Memory    Count    Price (per minute)    availability
1x_nvidia_h100_80gb_sxm  h100   80GB      1        \$0.06 available
Then create the endpoint, using the ID for your preferred hardware option:
together endpoints create \
  --model Qwen/Qwen3.5-9B-FP8 \
  --hardware 1x_nvidia_h100_80gb_sxm \
  --display-name "My endpoint" \
  --wait

Output

A successful create returns the new endpoint object:
{
  "object": "endpoint",
  "id": "endpoint-d23901de-ef8f-44bf-b3e7-de9c1ca8f2d7",
  "name": "devuser/Qwen/Qwen3.5-9B-FP8-a32b82a1",
  "display_name": "My endpoint",
  "model": "Qwen/Qwen3.5-9B-FP8",
  "hardware": "1x_nvidia_h100_80gb_sxm",
  "type": "dedicated",
  "owner": "devuser",
  "state": "PENDING",
  "autoscaling": { "min_replicas": 1, "max_replicas": 1 },
  "created_at": "2026-05-04T10:43:55.405Z"
}
These are the two fields you’ll use the most:
FieldExampleWhat it’s for
idendpoint-d23901de-ef8f-44bf-b3e7-de9c1ca8f2d7The unique identifier for the endpoint. Pass this as endpoint_id to all management operations: inspect, start, stop, update, and delete.
namedevuser/Qwen/Qwen3.5-9B-FP8-a32b82a1The model identifier you pass as the model parameter when calling inference APIs. It includes your username, the base model, and a unique suffix so you can run multiple deployments of the same base model side by side.
The endpoint starts in PENDING and moves to STARTED once provisioning finishes (typically a few minutes). You can call the inference API as soon as the state is STARTED.

Target an availability zone

If you have latency or geographic constraints, you can target a specific availability zone. Only do this if you need to, since it can limit hardware availability.
together endpoints create \
  --model Qwen/Qwen3.5-9B-FP8 \
  --hardware 1x_nvidia_h100_80gb_sxm \
  --display-name "My endpoint" \
  --availability-zone us-east-1a \
  --wait

# List all availability zones
together endpoints availability-zones

Inspect an endpoint

Get the current state and configuration of an endpoint by ID:
together endpoints retrieve <endpoint_id>
Sample output:
ID:           endpoint-e6c6b82f-90f7-45b7-af39-3ca3b51d08xx
Name:         tester/Qwen/Qwen3.5-9B-FP8-bb04c904
Display Name: My endpoint
Hardware:     1x_nvidia_h100_80gb_sxm
Autoscaling:  Min=1, Max=1
Model:        Qwen/Qwen3.5-9B-FP8
Type:         dedicated
Owner:        tester
State:        READY
Created:      2025-02-18 11:55:50.686000+00:00

List your endpoints

# All your endpoints
together endpoints list

# Only on-demand dedicated endpoints
together endpoints list --type dedicated --usage-type on-demand

Start, stop, and restart

Stopping an endpoint pauses billing. Restarting brings it back online without re-provisioning hardware (subject to availability).
# Stop a running endpoint (billing pauses immediately)
together endpoints stop <endpoint_id>

# Start a stopped endpoint
together endpoints start <endpoint_id>

Update endpoint settings

You can change replica counts on a running endpoint without re-creating it. Both min_replicas and max_replicas must be supplied together.
together endpoints update --min-replicas 2 --max-replicas 4 <endpoint_id>
For other settings (hardware, decoding optimizations), see Endpoint settings. Some updates require a full deployment.

Delete an endpoint

Deletion is permanent. Stopped endpoints incur no charges, so prefer stopping unless you want to completely remove the endpoint.
together endpoints delete <endpoint_id>

Troubleshooting

Endpoints don’t always deploy immediately. Here are the most common reasons:
  • Low availability: Hardware may be available but only enough for a partial replica count. The endpoint starts but scales to the available count. If your minimum replica count is higher than current capacity, the endpoint stays queued until capacity recovers. To avoid the wait, lower the minimum replica count.
  • Hardware unavailable error: If you see “Hardware for endpoint not available now, please try again later”, the hardware you selected is fully claimed. Try a comparable model on different hardware (use whichllm.together.ai to find substitutes), or retry later.
  • Model not supported: Not every model is available for dedicated endpoint deployments. For a list of deployable models, see the dedicated endpoint model catalog. A fine-tuned model can only deploy on a dedicated endpoint if its base model is supported.

Next steps

Quickstart

Deploy and call your first endpoint in 5 minutes.

Available models

Browse the list of available models for instant deployment.

Endpoint settings

Configure endpoint hardware, autoscaling, decoding, prompt caching.

Scaling

Learn how endpoints scale and when to use vertical vs. horizontal scaling.