> ## Documentation Index
> Fetch the complete documentation index at: https://docs.together.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Deploying a fine-tuned model

> Once your fine-tune job completes, you should see your new model in [your models dashboard](https://api.together.ai/models).

<Frame>
  <img src="https://mintcdn.com/togetherai-52386018/msWWavplJrEZR36N/images/docs/6171b3cfa4cc84a7ab09099af13064563184722f04b35a788abd122347864d28-image.png?fit=max&auto=format&n=msWWavplJrEZR36N&q=85&s=6532d94969ffda2b3f1f19bff5cf573b" alt="" width="1432" height="275" data-path="images/docs/6171b3cfa4cc84a7ab09099af13064563184722f04b35a788abd122347864d28-image.png" />
</Frame>

To use your model, you can either:

1. Host it on Together AI as a [dedicated endpoint](/docs/dedicated-endpoints/overview) for an hourly usage fee
2. Download your model and run it locally

## Hosting your model on Together AI

<Warning>Dedicated endpoints bill per minute even when idle. Stop or delete the endpoint when you're done to avoid charges.</Warning>

You can deploy a fine-tuned model as a dedicated endpoint through the dashboard or programmatically.

<Tabs>
  <Tab title="Dashboard">
    Select your model in [the models dashboard](https://api.together.ai/models) and select **Create dedicated endpoint** to launch a [dedicated endpoint](/docs/dedicated-endpoints-ui) for the fine-tuned model.

    <Frame>
      <img src="https://mintcdn.com/togetherai-52386018/msWWavplJrEZR36N/images/docs/b17fd6bd03dcfb26b91389b864cf0ce3a275a2f22db2b56a975b1ffdba3c7789-image.png?fit=max&auto=format&n=msWWavplJrEZR36N&q=85&s=25985393a46001b7f6caa2411fa8e4a6" alt="" width="1441" height="610" data-path="images/docs/b17fd6bd03dcfb26b91389b864cf0ce3a275a2f22db2b56a975b1ffdba3c7789-image.png" />
    </Frame>

    Return to the dashboard and stop the endpoint when you're not using it to halt billing.
  </Tab>

  <Tab title="CLI/SDK">
    First, retrieve the output model name from your completed fine-tuning job. The `x_model_output_name` field is empty while the job is `pending`, `queued`, or `running`; it's populated only once the job reaches `completed`.

    Then create the endpoint, wait until it's ready, query it, and delete it when you're done. Use `endpoint.name` (not the raw output model name) as the `model` parameter for inference.

    <CodeGroup>
      ```python Python theme={null}
      import os
      import time
      from together import Together

      client = Together(api_key=os.environ.get("TOGETHER_API_KEY"))

      # 1. Get the output model name from the completed job
      status = client.fine_tuning.retrieve(id="ft-xxxx-yyyy")
      output_model = status.x_model_output_name

      # 2. Create the endpoint
      endpoint = client.endpoints.create(
          display_name="My fine-tuned endpoint",
          model=output_model,
          hardware="4x_nvidia_h100_80gb_sxm",
          autoscaling={"min_replicas": 1, "max_replicas": 1},
      )

      # 3. Wait until ready, then query
      while True:
          ep = client.endpoints.retrieve(endpoint.id)
          if ep.state == "STARTED":
              break
          if ep.state in ("FAILED", "STOPPED"):
              raise RuntimeError(f"Endpoint ended with state: {ep.state}")
          time.sleep(30)

      response = client.chat.completions.create(
          model=endpoint.name,
          messages=[
              {
                  "role": "user",
                  "content": "What are some fun things to do in New York?",
              }
          ],
      )
      print(response.choices[0].message.content)

      # 4. Delete when done to stop billing
      client.endpoints.delete(endpoint.id)
      ```

      ```typescript TypeScript theme={null}
      import Together from 'together-ai';

      const client = new Together({ apiKey: process.env['TOGETHER_API_KEY'] });

      // 1. Get the output model name from the completed job
      const status = await client.fineTuning.retrieve('ft-xxxx-yyyy');
      const outputModel = status.x_model_output_name;

      // 2. Create the endpoint
      const endpoint = await client.endpoints.create({
        display_name: 'My fine-tuned endpoint',
        model: outputModel,
        hardware: '4x_nvidia_h100_80gb_sxm',
        autoscaling: { min_replicas: 1, max_replicas: 1 },
      });

      // 3. Wait until ready, then query
      while (true) {
        const ep = await client.endpoints.retrieve(endpoint.id);
        if (ep.state === 'STARTED') break;
        if (ep.state === 'FAILED' || ep.state === 'STOPPED') {
          throw new Error(`Endpoint ended with state: ${ep.state}`);
        }
        await new Promise((r) => setTimeout(r, 30000));
      }

      const response = await client.chat.completions.create({
        model: endpoint.name,
        messages: [{ role: 'user', content: 'What are some fun things to do in New York?' }],
      });
      console.log(response.choices[0].message.content);

      // 4. Delete when done
      await client.endpoints.delete(endpoint.id);
      ```

      ```shell CLI theme={null}
      ## List recent jobs to find the output model name
      tg fine-tuning list

      ## Create the endpoint (--wait blocks until it's ready)
      tg endpoints create \
        --model <your-model-output-name> \
        --hardware 4x_nvidia_h100_80gb_sxm \
        --display-name "My fine-tuned endpoint" \
        --wait

      ## Query the endpoint
      together chat.completions \
        --model <endpoint-name> \
        --message "user" "What are some fun things to do in New York?"

      ## Delete when done
      tg endpoints delete <endpoint-id>
      ```
    </CodeGroup>
  </Tab>
</Tabs>

For full endpoint management options, see [Dedicated endpoints](/docs/dedicated-endpoints/overview).

## Running your model locally

To run your model locally, first download it by calling `download` with your job ID:

<CodeGroup>
  ```shell CLI theme={null}
  tg fine-tuning download "ft-bb62e747-b8fc-49a3-985c-f32f7cc6bb04"
  ```

  ```python Python theme={null}
  import os
  from together import Together

  client = Together(api_key=os.environ.get("TOGETHER_API_KEY"))

  client.fine_tuning.download(
      id="ft-bb62e747-b8fc-49a3-985c-f32f7cc6bb04",
      output="my-model/model.tar.zst",
  )
  ```

  ```python Python(v2) theme={null}
  import os
  from together import Together

  client = Together(api_key=os.environ.get("TOGETHER_API_KEY"))

  # Using `with_streaming_response` gives you control to do what you want with the response.
  stream = client.fine_tuning.with_streaming_response.content(
      ft_id="ft-bb62e747-b8fc-49a3-985c-f32f7cc6bb04"
  )

  with stream as response:
      with open("my-model/model.tar.zst", "wb") as f:
          for chunk in response.iter_bytes():
              f.write(chunk)
  ```

  ```typescript TypeScript theme={null}
  import Together from 'together-ai';

  const client = new Together({
    apiKey: process.env['TOGETHER_API_KEY'],
  });

  const modelData = await client.fineTuning.content({
    ft_id: 'ft-bb62e747-b8fc-49a3-985c-f32f7cc6bb04',
  });
  ```
</CodeGroup>

Your model will be downloaded to the location specified in `output` as a `tar.zst` file, which is an archive file format that uses the [ZStandard](https://github.com/facebook/zstd) algorithm. You'll need to install ZStandard to decompress your model.

On Macs, you can use Homebrew:

<CodeGroup>
  ```shell Shell theme={null}
  brew install zstd
  cd my-model
  zstd -d model.tar.zst
  tar -xvf model.tar
  cd ..
  ```
</CodeGroup>

Once your archive is decompressed, you should see the following set of files:

```
tokenizer_config.json
special_tokens_map.json
pytorch_model.bin
generation_config.json
tokenizer.json
config.json
```

These can be used with various libraries and languages to run your model locally. [Transformers](https://pypi.org/project/transformers/) is a popular Python library for working with pretrained models, and using it with your new model looks like this:

<CodeGroup>
  ```python Python theme={null}
  from transformers import AutoTokenizer, AutoModelForCausalLM
  import torch

  device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

  tokenizer = AutoTokenizer.from_pretrained("./my-model")

  model = AutoModelForCausalLM.from_pretrained(
      "./my-model",
      trust_remote_code=True,
  ).to(device)

  input_context = "Space Robots are"
  input_ids = tokenizer.encode(input_context, return_tensors="pt")
  output = model.generate(
      input_ids.to(device),
      max_length=128,
      temperature=0.7,
  ).cpu()
  output_text = tokenizer.decode(output[0], skip_special_tokens=True)

  print(output_text)
  ```
</CodeGroup>

```
Space Robots are a great way to get your kids interested in science. After all, they are the future!
```

If you see the output, your new model is working.

<Check>
  You now have a custom fine-tuned model that you can run completely locally, either on your own machine or on networked hardware of your choice.
</Check>
