Human Annotation

Score and label your AI agent outputs through collaborative annotation queues with your team.

Human Annotation is a manual evaluation method where team members review and score traces, sessions, and observations. Use it to establish quality baselines, catch edge cases automated evaluators miss, and build labeled datasets for training.

Why Use Human Annotation

Establish quality baselines — create human-scored benchmarks to calibrate your automated evaluators against
Catch what automation misses — review nuanced outputs where automated scores fall short (tone, factual accuracy, safety)
Collaborative review — distribute annotation work across your team with managed queues and progress tracking
Consistent labeling — standardized score configurations ensure every reviewer uses the same criteria
Feed evaluation loops — annotation scores flow into your experiment comparisons and model governance dashboards

Getting Started

Step 1: Create Score Configurations

Before annotating, define what you're scoring. Navigate to Project Settings > Score Configs and create your scoring criteria.

Score types available:

Type	Example	Use Case
Numeric	1-5 rating	Quality, relevance, helpfulness
Categorical	Good / Bad / Neutral	Quick triage, sentiment
Boolean	Yes / No	Factual correctness, safety pass/fail

Step 2: Create an Annotation Queue

Navigate to Human Annotation in the sidebar and click Create Annotation Queue.

Configure your queue:

Name — descriptive name (e.g., "Weekly QA Review", "Safety Audit")
Score configs — select which scores annotators will fill in
Description — instructions for reviewers on how to score

Step 3: Add Items to the Queue

Populate your annotation queue with traces to review:

From traces table — select traces and add them to a queue
Automatically — configure filters to auto-populate queues based on criteria (e.g., low confidence scores, error traces)

Step 4: Annotate

Team members open the annotation queue and work through items one by one:

Review the full trace context — input, output, intermediate steps
See any existing automated scores for reference
Apply scores based on the configured criteria
Move to the next item

The annotation interface shows all relevant context so reviewers can make informed judgments without switching between views.

Annotation Queues

How They Work

Annotation queues organize the review workload:

Each queue has a set of score configurations that annotators fill in
Items are presented one at a time for focused review
Progress tracking shows how many items are reviewed vs. remaining
Multiple team members can work on the same queue simultaneously

Queue Management

Action	Description
Create queue	Define name, description, and score configs
Add items	Populate from traces, observations, or sessions
Assign reviewers	Invite team members to annotate
Track progress	Monitor completion rate and annotation quality

Scoring Traces Directly

You can also score individual traces without using queues:

Open any trace in the platform
Click the Annotate button
Select a score configuration
Enter your score value
The score appears in the trace's Scores tab

This is useful for ad-hoc reviews or when you spot something during normal trace exploration.

Using Annotation Data

In Experiments

When comparing experiment runs, annotation scores appear alongside automated scores. This lets you validate whether your automated evaluators agree with human judgment.

As Evaluation Baselines

Use annotation scores to:

Calibrate LLM-as-Judge evaluators against human preferences
Identify where automated scores diverge from human assessment
Build training data for custom evaluation models

In Dashboards

Annotation scores flow into your project dashboards and can be filtered, aggregated, and tracked over time — giving you a human-quality signal alongside your automated metrics.

Best Practices

Define Clear Scoring Guidelines

Write explicit instructions for each score configuration:

What does a score of 5 vs 1 mean?
What edge cases should reviewers watch for?
When should a reviewer skip vs. flag an item?

Start Small

Begin with a focused queue (50-100 items) on a specific quality dimension. Validate that the scoring criteria are clear before scaling up.

Calibrate Across Reviewers

Have multiple reviewers score the same items initially to check inter-annotator agreement. Adjust guidelines where disagreement is high.

Next Steps

Datasets Model Governance Guardrails

Datasets Model Lifecycle Management