Human Annotation
Score and label your AI agent outputs through collaborative annotation queues with your team.
Human Annotation is a manual evaluation method where team members review and score traces, sessions, and observations. Use it to establish quality baselines, catch edge cases automated evaluators miss, and build labeled datasets for training.
Why Use Human Annotation
- Establish quality baselines — create human-scored benchmarks to calibrate your automated evaluators against
- Catch what automation misses — review nuanced outputs where automated scores fall short (tone, factual accuracy, safety)
- Collaborative review — distribute annotation work across your team with managed queues and progress tracking
- Consistent labeling — standardized score configurations ensure every reviewer uses the same criteria
- Feed evaluation loops — annotation scores flow into your experiment comparisons and model governance dashboards
Getting Started
Step 1: Create Score Configurations
Before annotating, define what you're scoring. Navigate to Project Settings > Score Configs and create your scoring criteria.
Score types available:
| Type | Example | Use Case |
|---|---|---|
| Numeric | 1-5 rating | Quality, relevance, helpfulness |
| Categorical | Good / Bad / Neutral | Quick triage, sentiment |
| Boolean | Yes / No | Factual correctness, safety pass/fail |
Step 2: Create an Annotation Queue
Navigate to Human Annotation in the sidebar and click Create Annotation Queue.
Configure your queue:
- Name — descriptive name (e.g., "Weekly QA Review", "Safety Audit")
- Score configs — select which scores annotators will fill in
- Description — instructions for reviewers on how to score
Step 3: Add Items to the Queue
Populate your annotation queue with traces to review:
- From traces table — select traces and add them to a queue
- Automatically — configure filters to auto-populate queues based on criteria (e.g., low confidence scores, error traces)
Step 4: Annotate
Team members open the annotation queue and work through items one by one:
- Review the full trace context — input, output, intermediate steps
- See any existing automated scores for reference
- Apply scores based on the configured criteria
- Move to the next item
The annotation interface shows all relevant context so reviewers can make informed judgments without switching between views.
Annotation Queues
How They Work
Annotation queues organize the review workload:
- Each queue has a set of score configurations that annotators fill in
- Items are presented one at a time for focused review
- Progress tracking shows how many items are reviewed vs. remaining
- Multiple team members can work on the same queue simultaneously
Queue Management
| Action | Description |
|---|---|
| Create queue | Define name, description, and score configs |
| Add items | Populate from traces, observations, or sessions |
| Assign reviewers | Invite team members to annotate |
| Track progress | Monitor completion rate and annotation quality |
Scoring Traces Directly
You can also score individual traces without using queues:
- Open any trace in the platform
- Click the Annotate button
- Select a score configuration
- Enter your score value
- The score appears in the trace's Scores tab
This is useful for ad-hoc reviews or when you spot something during normal trace exploration.
Using Annotation Data
In Experiments
When comparing experiment runs, annotation scores appear alongside automated scores. This lets you validate whether your automated evaluators agree with human judgment.
As Evaluation Baselines
Use annotation scores to:
- Calibrate LLM-as-Judge evaluators against human preferences
- Identify where automated scores diverge from human assessment
- Build training data for custom evaluation models
In Dashboards
Annotation scores flow into your project dashboards and can be filtered, aggregated, and tracked over time — giving you a human-quality signal alongside your automated metrics.
Best Practices
Define Clear Scoring Guidelines
Write explicit instructions for each score configuration:
- What does a score of 5 vs 1 mean?
- What edge cases should reviewers watch for?
- When should a reviewer skip vs. flag an item?
Start Small
Begin with a focused queue (50-100 items) on a specific quality dimension. Validate that the scoring criteria are clear before scaling up.
Calibrate Across Reviewers
Have multiple reviewers score the same items initially to check inter-annotator agreement. Adjust guidelines where disagreement is high.