Return to the dashboard and stop the endpoint when you’re not using it to halt billing.
First, retrieve the output model name from your completed fine-tuning job. The x_model_output_name field is empty while the job is pending, queued, or running; it’s populated only once the job reaches completed.Then create the endpoint, wait until it’s ready, query it, and delete it when you’re done. Use endpoint.name (not the raw output model name) as the model parameter for inference.
import osimport timefrom together import Togetherclient = Together(api_key=os.environ.get("TOGETHER_API_KEY"))# 1. Get the output model name from the completed jobstatus = client.fine_tuning.retrieve(id="ft-xxxx-yyyy")output_model = status.x_model_output_name# 2. Create the endpointendpoint = client.endpoints.create( display_name="My fine-tuned endpoint", model=output_model, hardware="4x_nvidia_h100_80gb_sxm", autoscaling={"min_replicas": 1, "max_replicas": 1},)# 3. Wait until ready, then querywhile True: ep = client.endpoints.retrieve(endpoint.id) if ep.state == "STARTED": break if ep.state in ("FAILED", "STOPPED"): raise RuntimeError(f"Endpoint ended with state: {ep.state}") time.sleep(30)response = client.chat.completions.create( model=endpoint.name, messages=[ { "role": "user", "content": "What are some fun things to do in New York?", } ],)print(response.choices[0].message.content)# 4. Delete when done to stop billingclient.endpoints.delete(endpoint.id)
Your model will be downloaded to the location specified in output as a tar.zst file, which is an archive file format that uses the ZStandard algorithm. You’ll need to install ZStandard to decompress your model.On Macs, you can use Homebrew:
These can be used with various libraries and languages to run your model locally. Transformers is a popular Python library for working with pretrained models, and using it with your new model looks like this: