Learn how to run LLM-as-a-Judge evaluations
--purpose eval
to ensure the data is processed correctly.
classify
β Classifies the input into one of the provided categories. Returns one of the predefined classes.score
β Takes an input and produces a score within a specified range. Returns a numerical score.compare
β Takes responses from two models and determines which one is better according to a given criterion.model_name
β The model to use for evaluationsystem_template
β Jinja2 template providing guidance for the judge (see Understanding Templates)"prompt"
)model_name
β One of our supported modelssystem_template
β Jinja2 template for generation instructions (see Understanding Templates)input_template
β Jinja2 template for formatting input (see Understanding Templates)max_tokens
β Maximum tokens for generationtemperature
β Temperature setting for generationmodel_name
β The model to use for evaluationsystem_template
β Jinja2 template providing guidance for the judge (see Understanding Templates)model_name
β The model to use for evaluationsystem_template
β Jinja2 template providing guidance for comparison (see Understanding Templates)"file-95c8f0a3-e8cf-43ea-889a-e79b1f1ea1b9"
ClassifyEvaluationResult
):
Field | Type | Description |
---|---|---|
error | string | Present only when job fails |
label_counts | object<string, int> | Count of each label assigned (e.g., {"positive": 45, "negative": 30} ) |
pass_percentage | float | Percentage of samples with labels in pass_labels |
generation_fail_count | int | Failed generations when using model configuration |
judge_fail_count | int | Samples the judge couldnβt evaluate |
invalid_label_count | int | Judge responses that couldnβt be parsed into valid labels |
result_file_id | string | File ID for detailed row-level results |
ScoreEvaluationResult
):
Field | Type | Description |
---|---|---|
error | string | Present only on failure |
aggregated_scores.mean_score | float | Mean of all numeric scores |
aggregated_scores.std_score | float | Standard deviation of scores |
aggregated_scores.pass_percentage | float | Percentage of scores meeting pass threshold |
failed_samples | int | Total samples that failed processing |
invalid_score_count | int | Scores outside allowed range or unparseable |
generation_fail_count | int | Failed generations when using model configuration |
judge_fail_count | int | Samples the judge couldnβt evaluate |
result_file_id | string | File ID for per-sample scores and feedback |
CompareEvaluationResult
):
Field | Type | Description |
---|---|---|
error | string | Present only on failure |
A_wins | int | Count where Model A was preferred |
B_wins | int | Count where Model B was preferred |
Ties | int | Count where judge found no clear winner |
generation_fail_count | int | Failed generations from either model |
judge_fail_count | int | Samples the judge couldnβt evaluate |
result_file_id | string | File ID for detailed pairwise decisions |
result_file_id
result_file_id
to the Files API to download a complete report for auditing or deeper analysis.
Each line in the result_file_id
has a 'evaluation_status'
field that can contain 'True'
or 'False'
that indicates if the line was processed without any issues.
You can download the result file using the UI, API, or CLI
evaluation_status
field indicating if processing succeeded (True
) or failed (False
)system_template
and input_template
parameters support Jinja2 templating syntax.
Jinja2 templates allow you to inject columns from the dataset into the system_template
or input_template
for either the judge or the generation model.
"Please use the reference answer: {{reference_answer_column_name}}"
"Please use the following guidelines: {{guidelines_column_name}}"
"Continue: {{prompt_column_name}}"
"{{column_name.field_name}}"