POST
/
endpoints
from together import Together
import os

client = Together(
api_key=os.environ.get("TOGETHER_API_KEY"),
)

endpoint = client.endpoints.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
hardware="1x_nvidia_a100_80gb_sxm",
min_replicas=2,
max_replicas=5,
)

print(endpoint.id)
{
  "object": "endpoint",
  "id": "endpoint-d23901de-ef8f-44bf-b3e7-de9c1ca8f2d7",
  "name": "devuser/meta-llama/Llama-3-8b-chat-hf-a32b82a1",
  "display_name": "My Llama3 70b endpoint",
  "model": "meta-llama/Llama-3-8b-chat-hf",
  "hardware": "1x_nvidia_a100_80gb_sxm",
  "type": "dedicated",
  "owner": "devuser",
  "state": "STARTED",
  "autoscaling": {
    "min_replicas": 2,
    "max_replicas": 5
  },
  "created_at": "2025-02-04T10:43:55.405Z"
}

Authorizations

Authorization
string
header
default:default
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json
model
string
required

The model to deploy on this endpoint

Examples:

"meta-llama/Llama-3-8b-chat-hf"

hardware
string
required

The hardware configuration to use for this endpoint

Examples:

"1x_nvidia_a100_80gb_sxm"

autoscaling
object
required

Configuration for automatic scaling of the endpoint

display_name
string

A human-readable name for the endpoint

Examples:

"My Llama3 70b endpoint"

disable_prompt_cache
boolean
default:false

Whether to disable the prompt cache for this endpoint

disable_speculative_decoding
boolean
default:false

Whether to disable speculative decoding for this endpoint

state
enum<string>
default:STARTED

The desired state of the endpoint

Available options:
STARTED,
STOPPED
Example:

"STARTED"

inactive_timeout
integer | null

The number of minutes of inactivity after which the endpoint will be automatically stopped. Set to null, omit or set to 0 to disable automatic timeout.

Example:

60

Response

200

Details about a dedicated endpoint deployment

object
enum<string>
required

The type of object

Available options:
endpoint
Example:

"endpoint"

id
string
required

Unique identifier for the endpoint

Example:

"endpoint-d23901de-ef8f-44bf-b3e7-de9c1ca8f2d7"

name
string
required

System name for the endpoint

Example:

"devuser/meta-llama/Llama-3-8b-chat-hf-a32b82a1"

display_name
string
required

Human-readable name for the endpoint

Example:

"My Llama3 70b endpoint"

model
string
required

The model deployed on this endpoint

Example:

"meta-llama/Llama-3-8b-chat-hf"

hardware
string
required

The hardware configuration used for this endpoint

Example:

"1x_nvidia_a100_80gb_sxm"

type
enum<string>
required

The type of endpoint

Available options:
dedicated
Example:

"dedicated"

owner
string
required

The owner of this endpoint

Example:

"devuser"

state
enum<string>
required

Current state of the endpoint

Available options:
PENDING,
STARTING,
STARTED,
STOPPING,
STOPPED,
ERROR
Example:

"STARTED"

autoscaling
object
required

Configuration for automatic scaling of the endpoint

created_at
string<date-time>
required

Timestamp when the endpoint was created

Example:

"2025-02-04T10:43:55.405Z"