Introduction

This guide explains how to perform evaluations using the Together AI UI. For a comprehensive guide with detailed parameter descriptions, see AI Evaluations.

Step 1: Upload Your Dataset

Navigate to https://api.together.ai/evaluations and click “Create Evaluation”.
Upload your dataset or select one from your library.
Preview your dataset content in the “Dataset Preview” section.

Step 2: Customize Your Evaluation Job

We support three evaluation types:
  • Classify – Categorizes input into one of the provided categories
  • Score – Evaluates input and produces a score within a specified range
  • Compare – Compares responses from two models to determine which performs better according to given criteria

Judge Configuration

The judge object contains two required fields:
  • judge model – (string) The model used for evaluation
  • system template – (Jinja template) Provides guidance for the judge to assess the data

Model Configuration Parameters

Classify

  • labels – (list of strings) Categories for input classification. For each category, you can specify whether it’s considered ‘pass’ or ‘fail’ for statistics computation
  • model_to_evaluate – Configuration for the model being evaluated

Score

  • min_score – (float) Minimum score the judge can assign
  • max_score – (float) Maximum score the judge can assign
  • model_to_evaluate – Configuration for the model being evaluated

Compare

  • Only requires judge setup and two model configurations for comparison

Model Evaluation Configuration

Choose whether to evaluate existing data or generate new responses:
  • “No” – Data required for evaluation is already present in your dataset
  • “Yes” – Generate data using the model for evaluation
Option 1: Column Reference (String)
Use when evaluating pre-existing data from your dataset. Simply specify the column name containing the data to evaluate.
Option 2: Model Object
Use when generating new responses for evaluation. The object requires:
  • model_name – (string) One of our supported models
  • system_template – (Jinja2 template) An instruction for generation, e.g., “You are a helpful assistant.” (see Understanding Templates)
  • input_template – (Jinja2 template) Input format, e.g., "{{prompt}}" (see Understanding Templates)
  • max_tokens – (integer) Maximum tokens for generation
  • temperature – (float) Temperature setting for generation

Step 3: Monitor Job Progress

Wait for your evaluation job to complete.

Step 4: Review Results

Once complete, you can:
  • Preview statistics and responses in the Dataset Preview
  • Download the result file using the “Download” button