Overview - Together AI docs

Using a coding agent? Install the together-dedicated-endpoints skill to let your agent create and manage dedicated endpoints for you.

A dedicated endpoint serves a single model on hardware reserved only for you, offering predictable latency and no shared-fleet rate limits. They are highly configurable, allowing you to upload custom fine-tuned models, and configure autoscaling and decoding optimizations to match your workload. Dedicated endpoints use the same inference APIs as serverless models, allowing you to prototype with serverless, then switch to dedicated endpoints without changing your application code.

Get started

Quickstart

Deploy and call your first endpoint in 5 minutes.

Manage endpoints

Create, start, stop, update, and delete via the UI or API.

Endpoint settings

Configure endpoint hardware, autoscaling, decoding, prompt caching.

Inference APIs

Explore the API surface for chat, vision, audio, embeddings, and more.

Available models

Browse Together-hosted models you can deploy on dedicated endpoints.

Upload a custom model

Upload your own model weights.

Pricing

Dedicated endpoints bill per-minute by hardware while the endpoint is running, regardless of your model or request volume.

Hardware type	Cost/hour
1x H100 80GB	$3.99
1x H200 141GB	$5.49
1x B200 180GB	$9.95

Each running replica bills independently, and stop billing as soon as they are scaled down.

Third-party integrations

Quickstart

⌘I

Documentation Index

​Get started

Quickstart

Manage endpoints

Endpoint settings

Inference APIs

Available models

Upload a custom model

​Pricing

Get started

Pricing