Skip to main content
You can upload models from Hugging Face or S3 and run inference on a dedicated endpoint through Together AI.

Getting Started

Requirements

Currently, we support models that meet the following criteria:
  • Source: We support uploads from Hugging Face or S3.
  • Type: We support text generation and embedding models.
  • Scale: We currently only support models that fit in a single node. Multi-node models are not supported.

Model file structure

Your model files must be in standard Hugging Face model repository format, compatible with from_pretrained loading. A valid model directory should contain files like:
config.json
generation_config.json
model-00001-of-00004.safetensors
model-00002-of-00004.safetensors
model-00003-of-00004.safetensors
model-00004-of-00004.safetensors
model.safetensors.index.json
special_tokens_map.json
tokenizer.json
tokenizer_config.json

Uploading from Hugging Face

When uploading from Hugging Face, simply provide the repository path (e.g., meta-llama/Llama-2-7b-hf). The model will be fetched directly from the Hugging Face Hub. You’ll also need to provide your Hugging Face token.

Uploading from S3

When uploading from S3, you must provide a presigned URL pointing to a single archive file containing the model files. Supported archive formats:
  • .zip
  • .tar
  • .tar.gz
Archive structure requirements: The model files must be at the root of the archive, not nested inside an extra top-level directory. Correct - files at root:
config.json
model.safetensors
tokenizer.json
...
Incorrect - files nested in a directory:
my-model/
  config.json
  model.safetensors
  tokenizer.json
  ...
If you have a model directory, create the archive from within the directory:
cd /path/to/your/model
tar -czvf ../model.tar.gz .
Presigned URL requirements:
  • The presigned URL must point to the archive file in S3.
  • The presigned URL expiration time must be at least 100 minutes.

Upload the model

Model uploads can be done via the UI or CLI.

UI

To upload via the web, log in and navigate to models > upload a model to reach this page:
Upload model
Then fill in the source URL (Hugging Face repo path or S3 presigned URL), the model name and how you would like it described in your Together account once uploaded.

CLI

Upload a model from Hugging Face or S3:
together models upload \
  --model-name <your_model_name> \
  --model-source <path_to_model_or_repo> \
  --hf-token <your_HF_token>
OptionRequiredDescription
--model-nameYesThe name to give to your uploaded model
--model-sourceYesHugging Face repo path or S3 presigned URL
--hf-tokenYes (for HF)Your Hugging Face token. Required for most Hugging Face models
--model-typeNomodel (default) or adapter
--descriptionNoA description of your model

Checking the status of your upload

When an upload has been kicked off, it will return a job id. You can poll our API using the returned job id until the model has finished uploading.
curl -X GET "https://api.together.ai/v1/jobs/{jobId}" \
     -H "Authorization: Bearer $TOGETHER_API_KEY" \
     -H "Content-Type: application/json" \
The output contains a “status” field. When the “status” is “Complete”, your model is ready to be deployed.

Deploy the model

Uploaded models are treated like any other dedicated endpoint models. Deploying can be done via the UI or CLI.

UI

All models, custom and finetuned models as well as any model that has a dedicated endpoint will be listed under My Models. To deploy: Select the model to open the model page.
My Models
The model page will display details from your uploaded model with an option to create a dedicated endpoint.
Create Dedicated Endpoint
When you select ‘Create Dedicated Endpoint’ you will see an option to configure the deployment.
Create Dedicated Endpoint
Once an endpoint has been deployed, you can interact with it on the playground or via the API.

CLI

After uploading your model, you can verify its registration and check available hardware options. List your uploaded models:
together models list
View available GPU SKUs for a specific model:
together endpoints hardware --model <model-name>
Once your model is uploaded, create a dedicated inference endpoint:
together endpoints create \
  --display-name <endpoint-name> \
  --model <model-name> \
  --gpu h100 \
  --no-speculative-decoding \
  --gpu-count 2
After deploying, you can view all your endpoints and retrieve connection details such as URL, scaling configuration, and status. List all endpoints:
together endpoints list
Get details for a specific endpoint:
together endpoints get <endpoint-id>