Overview
Large language models can serve as judges to evaluate other language models or assess different types of content. You can simply describe in detail how you want the LLM-as-a-Judge to assess your inputs, and it will perform this evaluation for you. For example, they can identify and flag content containing harmful material, personal information, or other policy-violating elements. Another common use case is comparing the quality of two LLMs, or configurations of the same model (for example prompts) to determine which performs better on your specific task. Our Evaluations service allows you to easily submit tasks for assessment by a judge language model. With Evaluations, you can:- Compare models and configurations: Understand which setup works best for your task
- Measure performance: Use a variety of metrics to score your modelβs responses
- Filter datasets: Apply LLM-as-a-Judge to filter and curate your datasets
- Gain insights: Understand where your model excels and where it needs improvement
- Build with confidence: Ensure your models meet quality standards before deploying them to production
Quickstart
To launch evaluations using the UI, please refer to: AI Evaluations UI For the full API specification, please refer to docs Get started with the Evaluations API in just a few steps. This example shows you how to run a simple evaluation.1. Prepare Your Dataset
First, youβll need a dataset to evaluate your model on. The dataset should be in JSONL or CSV format. Each line must contain the same fields. Example JSONL dataset:- CSV: math_dataset.csv
- JSONL: math_dataset.jsonl
2. Upload Your Dataset
You can use our UI, API, or CLI. Make sure to specify--purpose eval
to ensure the data is processed correctly.
3. Run the Evaluation
We support three evaluation types, each designed for specific assessment needs:classify
β Classifies the input into one of the provided categories. Returns one of the predefined classes.score
β Takes an input and produces a score within a specified range. Returns a numerical score.compare
β Takes responses from two models and determines which one is better according to a given criterion.
Evaluation Type: Classify
Purpose: Categorizes input into predefined classes (e.g., βToxicβ vs βNon-toxicβ) Parameters:- judge (required): Configuration for the judge model
model
β The model to use for evaluationmodel_source
β One of: βserverlessβ, βdedicatedβ, or βexternalβsystem_template
β Jinja2 template providing guidance for the judge (see Understanding Templates)- Python client: pass these as
judge_model_name
andjudge_model_source
- labels (required): List of strings defining the classification categories
- pass_labels (optional): List of labels considered as βpassingβ for statistics
- model_to_evaluate (required): Configuration for the model being evaluated
- Can be either:
- A string referencing a column in your dataset (e.g.,
"prompt"
) - A model configuration object (see below)
- A string referencing a column in your dataset (e.g.,
- Can be either:
- input_data_file_path (required): File ID of your uploaded dataset
model
β Choose from serverless models or LoRA serverless; formodel_source = "dedicated"
, use your dedicated endpointmodel_source
β Literal: βserverlessβ | βdedicatedβ | βexternalβ (required)external_api_token
β Optional; required whenmodel_source
= βexternalβsystem_template
β Jinja2 template for generation instructions (see Understanding Templates)input_template
β Jinja2 template for formatting input (see Understanding Templates)max_tokens
β Maximum tokens for generationtemperature
β Temperature setting for generation
model_source = "serverless"
supports any Together serverless model, including LoRA serverless variants and LoRA fine-tuned models. See LoRA serverless for supported models.model_source = "dedicated"
targets a user-launched dedicated endpoint (must be created before running evaluations).model_source = "external"
is under development and will be available soon;
Evaluation Type: Score
Purpose: Rates input on a numerical scale (e.g., quality score from 1-10) Parameters:- judge (required): Configuration for the judge model
model
β The model to use for evaluationmodel_source
β One of: βserverlessβ, βdedicatedβ, or βexternalβsystem_template
β Jinja2 template providing guidance for the judge (see Understanding Templates)
- min_score (required): Minimum score the judge can assign (float)
- max_score (required): Maximum score the judge can assign (float)
- pass_threshold (optional): Score at or above which is considered βpassingβ
- model_to_evaluate (required): Configuration for the model being evaluated
- Can be either:
- A string referencing a column in your dataset
- A model configuration object (same structure as in Classify)
- Can be either:
- input_data_file_path (required): File ID of your uploaded dataset
Evaluation Type: Compare
Purpose: Determines which of two models performs better on the same task Parameters:- judge (required): Configuration for the judge model
model
β The model to use for evaluationmodel_source
β One of: βserverlessβ, βdedicatedβ, or βexternalβsystem_template
β Jinja2 template providing guidance for comparison (see Understanding Templates)- Python client: pass these as
judge_model_name
,judge_model_source
, and optionaljudge_external_api_token
- model_a (required): Configuration for the first model
- Can be either:
- A string referencing a column in your dataset
- A model configuration object
- Can be either:
- model_b (required): Configuration for the second model
- Can be either:
- A string referencing a column in your dataset
- A model configuration object
- Can be either:
- input_data_file_path (required): File ID of your uploaded dataset
Text
"file-95c8f0a3-e8cf-43ea-889a-e79b1f1ea1b9"
5. View Results
We provide comprehensive results without omitting lines from the original file unless errors occur (up to 30% may be omitted in error cases).Result Formats by Evaluation Type
Classify Results (ClassifyEvaluationResult
):
Field | Type | Description |
---|---|---|
error | string | Present only when job fails |
label_counts | object<string, int> | Count of each label assigned (e.g., {"positive": 45, "negative": 30} ) |
pass_percentage | float | Percentage of samples with labels in pass_labels |
generation_fail_count | int | Failed generations when using model configuration |
judge_fail_count | int | Samples the judge couldnβt evaluate |
invalid_label_count | int | Judge responses that couldnβt be parsed into valid labels |
result_file_id | string | File ID for detailed row-level results |
ScoreEvaluationResult
):
Field | Type | Description |
---|---|---|
error | string | Present only on failure |
aggregated_scores.mean_score | float | Mean of all numeric scores |
aggregated_scores.std_score | float | Standard deviation of scores |
aggregated_scores.pass_percentage | float | Percentage of scores meeting pass threshold |
failed_samples | int | Total samples that failed processing |
invalid_score_count | int | Scores outside allowed range or unparseable |
generation_fail_count | int | Failed generations when using model configuration |
judge_fail_count | int | Samples the judge couldnβt evaluate |
result_file_id | string | File ID for per-sample scores and feedback |
CompareEvaluationResult
):
Field | Type | Description |
---|---|---|
error | string | Present only on failure |
A_wins | int | Count where Model A was preferred |
B_wins | int | Count where Model B was preferred |
Ties | int | Count where judge found no clear winner |
generation_fail_count | int | Failed generations from either model |
judge_fail_count | int | Samples the judge couldnβt evaluate |
result_file_id | string | File ID for detailed pairwise decisions |
Downloading Result Files
π Using result_file_id
Pass any result_file_id
to the Files API to download a complete report for auditing or deeper analysis.
Each line in the result_file_id
has a 'evaluation_status'
field that can contain 'True'
or 'False'
that indicates if the line was processed without any issues.
You can download the result file using the UI, API, or CLI
- Original input data
- Generated responses (if applicable)
- Judgeβs decision and feedback
evaluation_status
field indicating if processing succeeded (True
) or failed (False
)
Understanding Templates
Templates are used throughout the Evaluations API to dynamically inject data from your dataset into prompts. Bothsystem_template
and input_template
parameters support Jinja2 templating syntax.
Jinja2 templates allow you to inject columns from the dataset into the system_template
or input_template
for either the judge or the generation model.
Examples
- You can specify a reference answer for the judge:
"Please use the reference answer: {{reference_answer_column_name}}"
- You can provide a separate instruction for generation for each example:
"Please use the following guidelines: {{guidelines_column_name}}"
- You can specify any column(s) as input for the model being evaluated:
"Continue: {{prompt_column_name}}"
- You can also reference nested fields from your JSON input:
"{{column_name.field_name}}"
- And many more options are supported.
Basic Example
If your dataset contains:Nested Data Example
For complex structures:Best Practices
- Provide clear judge instructions: Write detailed, structured system prompts with examples and explicit rules for the judge to follow
- Choose appropriate judge models: Use larger, more capable models as judges than the models being evaluated
- Test your templates: Verify that your Jinja2 templates correctly format your data before running large evaluations
Python
Example: Classification System Prompt
Hereβs an example of a well-structured system prompt for a classify evaluation that determines whether model responses are harmful:- Clear role definition: Explicitly states the evaluatorβs single purpose
- Structured procedure: Step-by-step evaluation process
- Specific criteria: Well-defined categories with examples
- Decision rules: Clear instructions for edge cases