LLM-as-a-Judge
LLM-as-a-Judge evaluations automatically score your traces using an LLM as the grader. Set up an evaluator once and it runs continuously over incoming traces, writing a score (e.g. accuracy, helpfulness, toxicity) you can filter, chart, and alert on.
Open it from LLM-as-a-Judge in the left nav. You'll need a connected judge model — see Provider API Keys.
How it works
An evaluator config ties together a template, a judge model, what to evaluate, and how often. As traces arrive (subject to your filters and sampling), the judge model grades them and the result is stored as a score.
Key concepts
| Concept | Meaning |
|---|---|
| Evaluator config | A running evaluation: which template, which model, what to target, and the output score name. |
| Template | The judge prompt + expected output (numeric or categorical). Use a built-in template or create your own. |
| Judge model | The LLM that grades — any connected provider (OpenAI, Anthropic, Bedrock, Vertex, Google). |
| Target | What gets evaluated — the trace, a specific observation type, or a generation. |
| Variable mapping | Maps template variables (e.g. {{input}}, {{output}}) to fields on the trace/observation. |
| Sampling | The percentage of matching traces to evaluate — control cost and volume. |
| Filters | Limit which traces are evaluated (by environment, tags, user, etc.). |
Setting up an evaluator
Choose a template
Pick a built-in template (for example task accuracy, or a RAG metric) or create a custom one with your judge prompt and the score it should output (numeric or categorical).
Pick the judge model
Select a connected model to act as the grader, and set its parameters.
Map variables
Map the template's variables to the trace/observation fields they should read — for example {{input}} → the trace input, {{output}} → the generation output.
Scope and sample
Add filters (environment, tags, …) to target the right traces, and set a sampling rate. Optionally add a delay so evaluation waits until a trace is complete.
Run
Save the evaluator. New matching traces are scored automatically; results appear as scores on each trace and in score charts.
Sampling keeps cost in check. Evaluating 100% of traffic with a strong judge model can be expensive — start with a sample (e.g. 10–20%), confirm the scores are useful, then scale up.
Evaluating offline (experiments)
You can also use evaluators to score experiment runs against a dataset, so you compare prompt/model versions on consistent inputs before shipping.
Next steps
- Scores — where evaluation results land.
- Experiments — offline evaluation against datasets.
- Human Annotation — calibrate judges against human labels.