Skip to main content
Once a fine-tuning job completes, your model is available for inference in two ways: hosted on a dedicated endpoint at Together AI, or downloaded as a standalone checkpoint.

Prerequisites

  • A completed fine-tuning job. See the quickstart for the full lifecycle.
  • The job’s x_model_output_name (visible once status is completed). It follows the pattern <your_account>/<base_model>:<suffix>:<job_id>.

Deploy on a dedicated endpoint

Dedicated endpoints bill per minute even when idle. Stop or delete the endpoint when you’re done to avoid charges.
import os
import time
from together import Together

client = Together()

# 1. Get the output model name from the completed job
status = client.fine_tuning.retrieve(id="<JOB_ID>")
output_model = status.x_model_output_name

# 2. Preflight: confirm the base can host a fine-tune.
# A 404 means the base (often a `-Reference` model) can't be deployed.
client.endpoints.list_hardware(model=status.model)

# 3. Create the endpoint. Use a hardware id returned by list_hardware
# above; the exact options depend on the base model.
endpoint = client.endpoints.create(
    display_name="My fine-tuned endpoint",
    model=output_model,
    hardware="1x_nvidia_h100_80gb_sxm",
    autoscaling={"min_replicas": 1, "max_replicas": 1},
)

# 4. Poll until ready
deadline = time.time() + 20 * 60  # safety cap: 20 minutes
while True:
    ep = client.endpoints.retrieve(endpoint.id)
    if ep.state == "STARTED":
        break
    if ep.state in ("FAILED", "STOPPED"):
        raise RuntimeError(f"Endpoint ended with state: {ep.state}")
    if time.time() > deadline:
        raise TimeoutError(f"Endpoint still {ep.state} after 20 minutes")
    time.sleep(30)

# 5. Send a request. Use endpoint.name (not the raw output model) as the model parameter.
response = client.chat.completions.create(
    model=endpoint.name,
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

# 6. Delete when done to stop billing
client.endpoints.delete(endpoint.id)
If endpoint creation fails immediately with “There was an issue starting your endpoint”, the cause is almost always an incompatible base model. Verify with client.endpoints.list_hardware(model=...); a 404 means the base (often a -Reference model) can’t host a fine-tune. Pick a different base before retrying.

Run locally

To run your model outside Together, download the checkpoint by job ID:
from together import Together

client = Together()

# `checkpoint` is required unless you set `checkpoint_step`.
# Use "merged" for LoRA jobs; full fine-tunes only support "model_output_path".
response = client.fine_tuning.content(ft_id="<JOB_ID>", checkpoint="merged")
response.write_to_file("my-model/model.tar.zst")

Choose a checkpoint type

The checkpoint parameter selects what to download. It’s required for the v2 SDK’s content() method and the GET /v1/finetune/download endpoint, unless you pass checkpoint_step, which downloads a specific intermediate step and overrides checkpoint. Valid values depend on how the job was trained.
Job typeValid checkpoint valuesWhat you get
LoRA fine-tunemerged, adapter, or model_output_pathmerged combines the base model and adapter into self-contained weights, the usual choice for running the model locally or uploading it elsewhere. adapter returns only the LoRA adapter weights, so you can load them on top of the base model yourself (for example, with PEFT or vLLM).
Full fine-tunemodel_output_path onlyThe full set of trained model weights. merged and adapter return an error for full fine-tunes.
model_output_path returns the raw training output directory before any merging. It works for both job types but is mainly useful for advanced workflows that need the unmodified artifacts: for LoRA jobs, prefer merged or adapter; for full fine-tunes, it’s the only option. The v1 SDK’s client.fine_tuning.download() selects the checkpoint automatically (merged for LoRA jobs, model_output_path for full fine-tunes), so you don’t pass a checkpoint argument there. The output is a .tar.zst archive that uses ZStandard compression. On macOS, install zstd with Homebrew and decompress:
brew install zstd
cd my-model
zstd -d model.tar.zst
tar -xvf model.tar
You should see:
tokenizer_config.json
special_tokens_map.json
pytorch_model.bin
generation_config.json
tokenizer.json
config.json
Load the model with Hugging Face Transformers:
Python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained("./my-model")
model = AutoModelForCausalLM.from_pretrained(
    "./my-model",
    trust_remote_code=True,
).to(device)

input_ids = tokenizer.encode("Space robots are", return_tensors="pt").to(
    device
)
output = model.generate(input_ids, max_length=128, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))
To download a specific checkpoint instead of the final one, pass --checkpoint-step <STEP_NUMBER> to tg fine-tuning download (or checkpoint_step=<STEP_NUMBER> to client.fine_tuning.content()). List checkpoints with tg fine-tuning list-checkpoints <JOB_ID>.

Troubleshooting

  • x_model_output_name is empty: The job hasn’t reached completed. Poll status with client.fine_tuning.retrieve(id=...) until it’s done. See Monitor a fine-tuning job for the polling pattern.
  • Endpoint creation fails immediately: Run client.endpoints.list_hardware(model=<base_model>). A 404 means the base can’t host a fine-tune. -Reference models fall into this bucket.
  • 404 on inference: Use endpoint.name as the model parameter, not the raw output model name. The endpoint name includes a unique suffix that routes traffic to your deployment.

Next steps

Upload a custom model

Upload your own model weights from outside the Together catalog.

Upload a LoRA adapter

Load a LoRA adapter onto a shared base instead of deploying a full model.

Manage endpoints

Inspect, start, stop, update, and delete dedicated endpoints.

Endpoint settings

Tune autoscaling, decoding, and auto-shutdown.