Docs/Llmops/Evaluations

LLM-as-a-Judge

LLM-as-a-Judge evaluations automatically score your traces using an LLM as the grader. Set up an evaluator once and it runs continuously over incoming traces, writing a score (e.g. accuracy, helpfulness, toxicity) you can filter, chart, and alert on.

ℹ

Open it from LLM-as-a-Judge in the left nav. You'll need a connected judge model — see Provider API Keys.

How it works

text

new trace ──► evaluator (template + judge model) ──► score written back to the trace

An evaluator config ties together a template, a judge model, what to evaluate, and how often. As traces arrive (subject to your filters and sampling), the judge model grades them and the result is stored as a score.

Key concepts

Concept	Meaning
Evaluator config	A running evaluation: which template, which model, what to target, and the output score name.
Template	The judge prompt + expected output (numeric or categorical). Use a built-in template or create your own.
Judge model	The LLM that grades — any connected provider (OpenAI, Anthropic, Bedrock, Vertex, Google).
Target	What gets evaluated — the trace, a specific observation type, or a generation.
Variable mapping	Maps template variables (e.g. `{{input}}`, `{{output}}`) to fields on the trace/observation.
Sampling	The percentage of matching traces to evaluate — control cost and volume.
Filters	Limit which traces are evaluated (by environment, tags, user, etc.).

Setting up an evaluator

Choose a template

Pick a built-in template (for example task accuracy, or a RAG metric) or create a custom one with your judge prompt and the score it should output (numeric or categorical).

Pick the judge model

Select a connected model to act as the grader, and set its parameters.

Map variables

Map the template's variables to the trace/observation fields they should read — for example {{input}} → the trace input, {{output}} → the generation output.

Scope and sample

Add filters (environment, tags, …) to target the right traces, and set a sampling rate. Optionally add a delay so evaluation waits until a trace is complete.

Run

Save the evaluator. New matching traces are scored automatically; results appear as scores on each trace and in score charts.

ℹ

Sampling keeps cost in check. Evaluating 100% of traffic with a strong judge model can be expensive — start with a sample (e.g. 10–20%), confirm the scores are useful, then scale up.

Evaluating offline (experiments)

You can also use evaluators to score experiment runs against a dataset, so you compare prompt/model versions on consistent inputs before shipping.

Next steps

Scores — where evaluation results land.
Experiments — offline evaluation against datasets.
Human Annotation — calibrate judges against human labels.