Introduction
This guide explains how to perform evaluations using the Together AI UI. For a comprehensive guide with detailed parameter descriptions, see AI Evaluations.Step 1: Upload Your Dataset
Navigate to https://api.together.ai/evaluations and click “Create Evaluation”.
Preview your dataset content in the “Dataset Preview” section.

Step 2: Customize Your Evaluation Job
We support three evaluation types:- Classify – Categorizes input into one of the provided categories
- Score – Evaluates input and produces a score within a specified range
- Compare – Compares responses from two models to determine which performs better according to given criteria
Judge Configuration
Thejudge object contains two required fields:
- judge model – (string) The model used for evaluation
- system template – (Jinja template) Provides guidance for the judge to assess the data

Model Configuration Parameters
Classify
- labels – (list of strings) Categories for input classification. For each category, you can specify whether it’s considered ‘pass’ or ‘fail’ for statistics computation
- model_to_evaluate – Configuration for the model being evaluated
Score
- min_score – (float) Minimum score the judge can assign
- max_score – (float) Maximum score the judge can assign
- model_to_evaluate – Configuration for the model being evaluated
Compare
- Only requires judge setup and two model configurations for comparison
Model Evaluation Configuration
Choose whether to evaluate existing data or generate new responses:- “Configure” – Generate data using the model for evaluation
- “Field name” – Data required for evaluation is already present in your dataset
Use when generating new responses for evaluation. The object requires:
- model_name – (string) One of our supported models
- model_source – (string) One of: “serverless”, “dedicated”, or “external”
- external_api_token – Optional; required when
model_source = "external". If you selectexternalmodel source, use this to provide API bearer authentication token (eg. OpenAI token) - external_base_url - Optional; when using an
externalmodel source, you can specify your own base URL. (e.g.,"https://api.openai.com"). The API must be OpenAIchat/completions-compatible. - system_template – (Jinja2 template) An instruction for generation, e.g., “You are a helpful assistant.” (see Understanding Templates)
- input_template – (Jinja2 template) Input format, e.g.,
"{{prompt}}"(see Understanding Templates) - max_tokens – (integer) Maximum tokens for generation
- temperature – (float) Temperature setting for generation
Use when evaluating pre-existing data from your dataset. Simply specify the column name containing the data to evaluate.

Using external models
When you setmodel_source = "external" (for either the judge or the model being evaluated):
- Enter a supported shortcut in the model field (e.g.,
openai/gpt-5). See Supported External Models. - Provide
external_api_token– use your API bearer token for the external provider (e.g., OpenAI token). - Optionally set
external_base_urlif using a custom endpoint (e.g.,https://api.openai.com). The API must be OpenAIchat/completions-compatible.
model_source = "dedicated" and paste your endpoint ID into the model field. See Dedicated Inference.
Step 3: Monitor Job Progress
Wait for your evaluation job to complete.
Step 4: Review Results
Once complete, you can:- Preview statistics and responses in the Dataset Preview
- Download the result file using the “Download” button
