Manage dedicated endpoints - Together AI docs

Create an endpoint

To avoid unexpected charges, you can set an auto-shutdown timer when creating an endpoint. Make sure to review your active deployments periodically in the models dashboard to stop endpoints you’re no longer using.

CLI / SDK
UI

First, list available hardware options for your model:

together endpoints hardware --model Qwen/Qwen3.5-9B-FP8

You’ll see output similar to this:

Shell

Hardware ID              GPU    Memory    Count    Price (per minute)    availability
1x_nvidia_h100_80gb_sxm  h100   80GB      1        \$0.06                ✓ available

Then create the endpoint, using the ID for your preferred hardware option:

together endpoints create \
  --model Qwen/Qwen3.5-9B-FP8 \
  --hardware 1x_nvidia_h100_80gb_sxm \
  --display-name "My endpoint" \
  --wait

Output

A successful create returns the new endpoint object:

{
  "object": "endpoint",
  "id": "endpoint-d23901de-ef8f-44bf-b3e7-de9c1ca8f2d7",
  "name": "devuser/Qwen/Qwen3.5-9B-FP8-a32b82a1",
  "display_name": "My endpoint",
  "model": "Qwen/Qwen3.5-9B-FP8",
  "hardware": "1x_nvidia_h100_80gb_sxm",
  "type": "dedicated",
  "owner": "devuser",
  "state": "PENDING",
  "autoscaling": { "min_replicas": 1, "max_replicas": 1 },
  "created_at": "2026-05-04T10:43:55.405Z"
}

These are the two fields you’ll use the most:

Field	Example	What it’s for
`id`	`endpoint-d23901de-ef8f-44bf-b3e7-de9c1ca8f2d7`	The unique identifier for the endpoint. Pass this as `endpoint_id` to all management operations: inspect, start, stop, update, and delete.
`name`	`devuser/Qwen/Qwen3.5-9B-FP8-a32b82a1`	The model identifier you pass as the `model` parameter when calling inference APIs. It includes your username, the base model, and a unique suffix so you can run multiple deployments of the same base model side by side.

The endpoint starts in PENDING and moves to STARTED once provisioning finishes (typically a few minutes). You can call the inference API as soon as the state is STARTED.

Target an availability zone

If you have latency or geographic constraints, you can target a specific availability zone. Only do this if you need to, since it can limit hardware availability.

together endpoints create \
  --model Qwen/Qwen3.5-9B-FP8 \
  --hardware 1x_nvidia_h100_80gb_sxm \
  --display-name "My endpoint" \
  --availability-zone us-east-1a \
  --wait

# List all availability zones
together endpoints availability-zones

Open the models page

Go to the Models page in the playground. Under All models, select Dedicated to filter to models that support dedicated endpoint deployment.

Pick a model

Browse the available models and select the one you want to deploy.

Pick hardware

Choose hardware for the endpoint. Options range across RTX-6000, L40, A100 SXM, A100 PCIe, and H100 at different price points.

Deploy

Click the play button. The endpoint takes up to 10 minutes to come up. You can navigate away while it provisions and come back when it’s ready.

Use the endpoint

Once ready, copy the model identifier shown on the endpoint page and use it as the model parameter in your API calls. You’ll find the endpoint anytime under My Models > Endpoints.

Need a custom configuration? Contact us.

Inspect an endpoint

Get the current state and configuration of an endpoint by ID:

together endpoints retrieve <endpoint_id>

Sample output:

ID:           endpoint-e6c6b82f-90f7-45b7-af39-3ca3b51d08xx
Name:         tester/Qwen/Qwen3.5-9B-FP8-bb04c904
Display Name: My endpoint
Hardware:     1x_nvidia_h100_80gb_sxm
Autoscaling:  Min=1, Max=1
Model:        Qwen/Qwen3.5-9B-FP8
Type:         dedicated
Owner:        tester
State:        READY
Created:      2025-02-18 11:55:50.686000+00:00

List your endpoints

# All your endpoints
together endpoints list

# Only on-demand dedicated endpoints
together endpoints list --type dedicated --usage-type on-demand

Start, stop, and restart

Stopping an endpoint pauses billing. Restarting brings it back online without re-provisioning hardware (subject to availability).

CLI / SDK
UI

# Stop a running endpoint (billing pauses immediately)
together endpoints stop <endpoint_id>

# Start a stopped endpoint
together endpoints start <endpoint_id>

Update endpoint settings

You can change replica counts on a running endpoint without re-creating it. Both min_replicas and max_replicas must be supplied together.

together endpoints update --min-replicas 2 --max-replicas 4 <endpoint_id>

For other settings (hardware, decoding optimizations), see Endpoint settings. Some updates require a full deployment.

Delete an endpoint

Deletion is permanent. Stopped endpoints incur no charges, so prefer stopping unless you want to completely remove the endpoint.

together endpoints delete <endpoint_id>

Troubleshooting

Endpoints don’t always deploy immediately. Here are the most common reasons:

Low availability: Hardware may be available but only enough for a partial replica count. The endpoint starts but scales to the available count. If your minimum replica count is higher than current capacity, the endpoint stays queued until capacity recovers. To avoid the wait, lower the minimum replica count.
Hardware unavailable error: If you see “Hardware for endpoint not available now, please try again later”, the hardware you selected is fully claimed. Try a comparable model on different hardware (use whichllm.together.ai to find substitutes), or retry later.
Model not supported: Not every model is available for dedicated endpoint deployments. For a list of deployable models, see the dedicated endpoint model catalog. A fine-tuned model can only deploy on a dedicated endpoint if its base model is supported.

Next steps

Quickstart

Deploy and call your first endpoint in 5 minutes.

Available models

Browse the list of available models for instant deployment.

Endpoint settings

Configure endpoint hardware, autoscaling, decoding, prompt caching.

Scaling

Learn how endpoints scale and when to use vertical vs. horizontal scaling.

Documentation Index

​Create an endpoint

​Output

​Target an availability zone

​Inspect an endpoint

​List your endpoints

​Start, stop, and restart

​Update endpoint settings

​Delete an endpoint

​Troubleshooting

​Next steps

Quickstart

Available models

Endpoint settings

Scaling

Create an endpoint

Output

Target an availability zone

Inspect an endpoint

List your endpoints

Start, stop, and restart

Update endpoint settings

Delete an endpoint

Troubleshooting

Next steps