Dedicated inference
Deploy models as dedicated endpoints using the Together AI API
The Together AI Dedicated Endpoints API allows you to deploy models as dedicated endpoints with custom hardware configurations and autoscaling capabilities.
This guide walks through the key API endpoints for managing dedicated model deployments.
Authentication
All API requests require authentication using your Together API key. Set your API key in the Authorization header:
curl -H "Authorization: Bearer YOUR_API_KEY" <https://api.together.xyz/v1/>...
API Endpoints
1. Models
Before creating a dedicated endpoint, you'll need to select or import a model to deploy.
List Available Models
Lists models that can be deployed as dedicated endpoints.
curl -X GET <https://api.together.xyz/v0/models> \
-H "Authorization: Bearer YOUR_API_KEY"
List Available Models Response
{
"object": "model",
"id": "model-42168e53-cfc5-474e-adb1-06fcd7aba56b",
"name": "meta-llama/Llama-3-70b-chat-hf",
"display_name": "Llama 3 70B Chat",
"owner": {
"user": "together",
"organization": "together.ai"
},
"type": "chat",
"num_parameters": 70000000000,
"context_length": 4096,
"chat_config": {
"chat_template": "...",
"add_generation_prompt": true,
"bos_token": "<s>",
"eos_token": "</s>",
"stop": ["</s>"]
}
}
Create Model Request
Select model a supported model from Together AI.
curl -X POST <https://api.together.xyz/v0/models> \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json"\
-d '{
"model_name": "mistral-large-2407",
"model_source": "mistralai/Mistral-Large-Instruct-2407",
"hf_token": "hf_YIIfXdff2eca3c3ac99fa262b6976",
"description": "Official Mistral Large Instruct Model"
}'
Create Model Response
{
"message": "Processing model weights",
"data": {
"job_id": "job-3231364a-b1ef-41c1-b5f1-42414e70cc4f",
"model_name": "devuser/mistral-large-2407",
"model_source": "mistralai/Mistral-Large-Instruct-2407"
}
}
2. Hardware Configuration
List Available Hardware
Get information about available hardware configurations for your endpoints.
curl -X GET <https://api.together.xyz/v1/hardware> \
-H "Authorization: Bearer YOUR_API_KEY"
List Available Hardware Response
{
"object": "list",
"data": [
{
"object": "hardware",
"name": "2x_nvidia_a100_80gb_sxm",
"pricing": {
"input": 0,
"output": 0,
"cents_per_minute": 5.42
},
"specs": {
"gpu_type": "a100-80gb",
"gpu_link": "sxm",
"gpu_memory": 80,
"gpu_count": 2
},
"updated_at": "2024-01-01T00:00:00Z"
}
]
}
Get Model-Specific Hardware Options
Retrieve hardware configurations available for a specific model
# Request
curl -X GET https://api.together.xyz/v1/hardware?model=meta-llama/Llama-3-70b-chat-hf \
-H "Authorization: Bearer YOUR_API_KEY"
Get Model Hardware Response
{
"object": "list",
"data": [
{
"object": "hardware",
"name": "2x_nvidia_a100_80gb_sxm",
"pricing": {
"input": 0,
"output": 0,
"cents_per_minute": 5.42
},
"specs": {
"gpu_type": "a100-80gb",
"gpu_link": "sxm",
"gpu_memory": 80,
"gpu_count": 2
},
"availability": {
"status": "available"
}
}
]
}
3. Endpoints Management
Create Endpoint Request
Create a new dedicated endpoint for your model. The endpoint will automatically start on create.
curl -X POST https://api.together.xyz/v1/endpoints \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"display_name": "My Llama3 70b endpoint",
"model": "meta-llama/Llama-3-8b-chat-hf",
"hardware": "1x_nvidia_a100_80gb_sxm",
"autoscaling": {
"min_replicas": 1,
"max_replicas": 3
}
}'
Create Endpoint Response
{
"object": "endpoint",
"id": "endpoint-d23901de-ef8f-44bf-b3e7-de9c1ca8f2d7",
"name": "devuser/meta-llama/Llama-3-8b-chat-hf-a32b82a1",
"display_name": "My Llama3 70b endpoint",
"model": "meta-llama/Llama-3-8b-chat-hf",
"hardware": "1x_nvidia_a100_80gb_sxm",
"type": "dedicated",
"owner": "devuser",
"state": "PENDING",
"autoscaling": {
"min_replicas": 1,
"max_replicas": 3
},
"created_at": "2025-02-04T10:43:55.405Z"
}
Get Endpoint by ID Request
Get info about an endpoint by ID. This can be used to get the endpoint details and also the current state of the endpoint.
curl -X GET https://api.together.xyz/v1/endpoints/{endpointId} \
-H "Authorization: Bearer YOUR_API_KEY"
Get Endpoint by ID Response
{
"object": "endpoint",
"id": "endpoint-d23901de-ef8f-44bf-b3e7-de9c1ca8f2d7",
"name": "devuser/meta-llama/Llama-3-8b-chat-hf-a32b82a1",
"display_name": "My Llama3 70b endpoint",
"model": "meta-llama/Llama-3-8b-chat-hf",
"hardware": "1x_nvidia_a100_80gb_sxm",
"type": "dedicated",
"owner": "devuser",
"state": "STARTED",
"autoscaling": {
"min_replicas": 1,
"max_replicas": 3
},
"created_at": "2025-02-04T10:43:55.405Z"
}
Update Endpoint Request
Update an endpoint.
curl -X PATCH https://api.together.xyz/v1/endpoints/${endpointId} \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"display_name": "My Llama3 70b endpoint old",
"autoscaling": {
"min_replicas": 1,
"max_replicas": 5
}
}'
Update Endpoint Response
{
"object": "endpoint",
"id": "endpoint-d23901de-ef8f-44bf-b3e7-de9c1ca8f2d7",
"name": "devuser/meta-llama/Llama-3-8b-chat-hf-a32b82a1",
"display_name": "My Llama3 70b endpoint",
"model": "meta-llama/Llama-3-8b-chat-hf",
"hardware": "1x_nvidia_a100_80gb_sxm",
"type": "dedicated",
"owner": "devuser",
"state": "STOPPING",
"autoscaling": {
"min_replicas": 1,
"max_replicas": 3
},
"created_at": "2025-02-04T10:43:55.405Z"
}
Stop Endpoint Request
Stop a running endpoint deployment.
curl -X PATCH https://api.together.xyz/v1/endpoints/${endpointId} \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"state": "STOPPED"
}'
Stop Endpoint Response
{
"object": "endpoint",
"id": "endpoint-d23901de-ef8f-44bf-b3e7-de9c1ca8f2d7",
"name": "devuser/meta-llama/Llama-3-8b-chat-hf-a32b82a1",
"display_name": "My Llama3 70b endpoint",
"model": "meta-llama/Llama-3-8b-chat-hf",
"hardware": "1x_nvidia_a100_80gb_sxm",
"type": "dedicated",
"owner": "devuser",
"state": "STOPPING",
"autoscaling": {
"min_replicas": 1,
"max_replicas": 3
},
"created_at": "2025-02-04T10:43:55.405Z"
}
Start Endpoint Request
Start a created endpoint.
curl -X PATCH https://api.together.xyz/v1/endpoints/${endpointId} \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"state": "STARTED"
}'
Start Endpoint Response
{
"object": "endpoint",
"id": "endpoint-d23901de-ef8f-44bf-b3e7-de9c1ca8f2d7",
"name": "devuser/meta-llama/Llama-3-8b-chat-hf-a32b82a1",
"display_name": "My Llama3 70b endpoint",
"model": "meta-llama/Llama-3-8b-chat-hf",
"hardware": "1x_nvidia_a100_80gb_sxm",
"type": "dedicated",
"owner": "devuser",
"state": "PENDING",
"autoscaling": {
"min_replicas": 1,
"max_replicas": 3
},
"created_at": "2025-02-04T10:43:55.405Z"
}
Delete Endpoint
Remove an endpoint deployment.
curl -X DELETE https://api.together.xyz/v1/endpoints/{endpoint_id} \
-H "Authorization: Bearer YOUR_API_KEY"
Delete Endpoint Response
204 No Content
List Endpoints
Retrieve a list of all your endpoints.
curl -X GET https://api.together.xyz/v1/endpoints?type=dedicated \
-H "Authorization: Bearer YOUR_API_KEY"
List Endpoints Response
{
"object": "list",
"data": [
{
"object": "endpoint",
"id": "endpoint-d23901de-ef8f-44bf-b3e7-de9c1ca8f2d7",
"name": "devuser/meta-llama/Llama-3-8b-chat-hf-a32b82a1",
"display_name": "My Llama3 70b endpoint",
"model": "meta-llama/Llama-3-8b-chat-hf",
"hardware": "1x_nvidia_a100_80gb_sxm",
"type": "dedicated",
"owner": "devuser",
"state": "PENDING",
"autoscaling": {
"min_replicas": 1,
"max_replicas": 3
},
"created_at": "2025-02-04T10:43:55.405Z"
}
]
}
4. Job Management
List Jobs
View all jobs related to your endpoints and model imports.
curl -X GET <https://api.together.xyz/v1/jobs> \
-H "Authorization: Bearer YOUR_API_KEY"
List Jobs Response
{
"type": "model_import",
"job_id": "job-3231364a-b1ef-41c1-b5f1-42414e70cc4f",
"status": "processing",
"status_updates": [
{
"status": "started",
"message": "Starting model import",
"timestamp": "2024-01-01T00:00:00Z"
}
],
"args": {
"model_name": "mistral-large-2407"
},
"created_at": "2024-01-01T00:00:00Z",
"updated_at": "2024-01-01T00:00:01Z"
}
Get Job Status
Check the status of a specific job.
curl -X GET <https://api.together.xyz/v1/jobs/{job_id}> \
-H "Authorization: Bearer YOUR_API_KEY"
Get Job Status Response
{
"object": "list",
"data": [
{
"type": "model_import",
"job_id": "job-3231364a-b1ef-41c1-b5f1-42414e70cc4f",
"status": "completed",
"status_updates": [
{
"status": "completed",
"message": "Model import successful",
"timestamp": "2024-01-01T00:01:00Z"
}
],
"args": {
"model_name": "mistral-large-2407"
},
"created_at": "2024-01-01T00:00:00Z",
"updated_at": "2024-01-01T00:01:00Z"
}
]
}
Error Handling
The API uses standard HTTP response codes and returns error details in JSON format, see below for an example:
{
"error": {
"message": "Internal server error",
"type": "server_error"
}
}
Updated 9 days ago