Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.together.ai/llms.txt

Use this file to discover all available pages before exploring further.

To use your model, you can either:
  1. Host it on Together AI as a dedicated endpoint for an hourly usage fee
  2. Download your model and run it locally

Hosting your model on Together AI

Dedicated endpoints bill per minute even when idle. Stop or delete the endpoint when you’re done to avoid charges.
You can deploy a fine-tuned model as a dedicated endpoint through the dashboard or programmatically.
Select your model in the models dashboard and select Create dedicated endpoint to launch a dedicated endpoint for the fine-tuned model.
Return to the dashboard and stop the endpoint when you’re not using it to halt billing.
For full endpoint management options, see Dedicated endpoints.

Running your model locally

To run your model locally, first download it by calling download with your job ID:
tg fine-tuning download "ft-bb62e747-b8fc-49a3-985c-f32f7cc6bb04"
Your model will be downloaded to the location specified in output as a tar.zst file, which is an archive file format that uses the ZStandard algorithm. You’ll need to install ZStandard to decompress your model. On Macs, you can use Homebrew:
brew install zstd
cd my-model
zstd -d model.tar.zst
tar -xvf model.tar
cd ..
Once your archive is decompressed, you should see the following set of files:
tokenizer_config.json
special_tokens_map.json
pytorch_model.bin
generation_config.json
tokenizer.json
config.json
These can be used with various libraries and languages to run your model locally. Transformers is a popular Python library for working with pretrained models, and using it with your new model looks like this:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained("./my-model")

model = AutoModelForCausalLM.from_pretrained(
    "./my-model",
    trust_remote_code=True,
).to(device)

input_context = "Space Robots are"
input_ids = tokenizer.encode(input_context, return_tensors="pt")
output = model.generate(
    input_ids.to(device),
    max_length=128,
    temperature=0.7,
).cpu()
output_text = tokenizer.decode(output[0], skip_special_tokens=True)

print(output_text)
Space Robots are a great way to get your kids interested in science. After all, they are the future!
If you see the output, your new model is working.
You now have a custom fine-tuned model that you can run completely locally, either on your own machine or on networked hardware of your choice.