Docs/Llmops/Scores

Scores

Scores are how quality is measured on the platform. A score is a value attached to a trace, observation, or session — and it can come from three places: your code (API), an automated evaluator, or human annotation.

ℹ

Open the Scores view from the left nav to browse and analyze all scores. Define reusable scoring criteria under Settings → Score Configs.

Score types

A score config defines a reusable, consistent scoring criterion:

Type	Values	Example
Numeric	A number, optionally bounded (e.g. 0–1)	`quality = 0.82`
Categorical	One of a fixed set of options	`sentiment = "positive"`
Boolean	True / false	`passed = true`

Defining a config (name, type, range or categories) ensures every source — API, evaluator, or annotator — scores the same thing the same way.

Where scores come from

Source	How
API	Your application submits scores directly via the SDK/API (e.g. user thumbs-up, business outcome).
Evaluation	An LLM-as-a-Judge evaluator grades traces automatically.
Annotation	A reviewer scores traces in an annotation queue.

The Scores view

Browse every score in the project. Filter by score name, source, type, value range, time, and the trace/observation/session it's attached to. Numeric scores chart as time series and histograms; categorical scores show as distributions. Scores also appear as columns on the traces, sessions, and users tables, and roll up across sessions and experiment runs.

Using scores

Track quality over time — chart a score to catch regressions.
Filter to failures — find low-scoring traces to debug or add to a dataset.
Compare versions — scores are the basis for experiment comparisons.

Next steps

LLM-as-a-Judge — generate scores automatically.
Human Annotation — score traces by hand.
Experiments — compare runs by score.