Scores
Scores are how quality is measured on the platform. A score is a value attached to a trace, observation, or session — and it can come from three places: your code (API), an automated evaluator, or human annotation.
Open the Scores view from the left nav to browse and analyze all scores. Define reusable scoring criteria under Settings → Score Configs.
Score types
A score config defines a reusable, consistent scoring criterion:
| Type | Values | Example |
|---|---|---|
| Numeric | A number, optionally bounded (e.g. 0–1) | quality = 0.82 |
| Categorical | One of a fixed set of options | sentiment = "positive" |
| Boolean | True / false | passed = true |
Defining a config (name, type, range or categories) ensures every source — API, evaluator, or annotator — scores the same thing the same way.
Where scores come from
| Source | How |
|---|---|
| API | Your application submits scores directly via the SDK/API (e.g. user thumbs-up, business outcome). |
| Evaluation | An LLM-as-a-Judge evaluator grades traces automatically. |
| Annotation | A reviewer scores traces in an annotation queue. |
The Scores view
Browse every score in the project. Filter by score name, source, type, value range, time, and the trace/observation/session it's attached to. Numeric scores chart as time series and histograms; categorical scores show as distributions. Scores also appear as columns on the traces, sessions, and users tables, and roll up across sessions and experiment runs.
Using scores
- Track quality over time — chart a score to catch regressions.
- Filter to failures — find low-scoring traces to debug or add to a dataset.
- Compare versions — scores are the basis for experiment comparisons.
Next steps
- LLM-as-a-Judge — generate scores automatically.
- Human Annotation — score traces by hand.
- Experiments — compare runs by score.