Datasets
Create and manage collections of test inputs and expected outputs for benchmarking, evaluating, and improving your AI agents.
Datasets are structured collections of inputs (and optionally expected outputs) that you use to test your AI application against known scenarios — from production edge cases to synthetic benchmarks.
Why Use Datasets
- Benchmark before deploying — test new model versions or prompt changes against a known set of inputs before shipping
- Capture production edge cases — save real traces that caused issues as test cases for regression testing
- Structured evaluation — run experiments across consistent inputs to compare models, prompts, or configurations
- Collaborative curation — your team can build and refine datasets together through the UI or SDKs
- Custom workflows — use datasets via the API for fine-tuning, few-shot prompting, or automated CI testing
Getting Started
Step 1: Create a Dataset
Navigate to Datasets in your project sidebar and click Create Dataset. Provide a name and optional description.
You can also create datasets programmatically:
from ants_platform import AntsPlatform
ants = AntsPlatform()
dataset = ants.create_dataset(
name="customer-support-qa",
description="Common customer support questions with expected responses",
)Step 2: Add Items
Each dataset item consists of an input (required) and an optional expected output. Add items through:
- UI — click "Add Item" in the dataset view
- SDK — create items programmatically
- CSV import — bulk upload from spreadsheets
- From traces — save production traces directly as dataset items
ants.create_dataset_item(
dataset_name="customer-support-qa",
input={"question": "How do I reset my password?"},
expected_output={"answer": "Go to Settings > Security > Reset Password"},
)Step 3: Run Experiments
Use your dataset to benchmark different configurations. Each experiment run evaluates your application against every item in the dataset and records the results.
Navigate to the dataset and click New Run, or trigger runs via the SDK for automated testing pipelines.
Step 4: Compare Results
View experiment runs side-by-side in the dataset dashboard. Compare scores, latency, cost, and output quality across different model versions or prompt configurations.
Dataset Items
Structure
Each dataset item contains:
| Field | Required | Description |
|---|---|---|
input | Yes | The input to your application (JSON) |
expected_output | No | The expected/ideal output for evaluation |
metadata | No | Additional context (tags, categories, difficulty level) |
source_trace_id | No | Link back to the production trace this item was created from |
Adding Items from Production
When reviewing traces in the platform, you can save any trace as a dataset item with one click. This is the fastest way to build datasets from real-world usage — especially for edge cases and failures.
CSV Import
For bulk imports, use the CSV upload feature:
- Go to your dataset
- Click Import CSV
- Map CSV columns to input/expected output fields
- Preview and confirm the import
Experiment Runs
Each experiment run evaluates your application against a dataset and records:
- Output for each input
- Scores from evaluators (automated or manual)
- Latency per item
- Cost per item
- Trace links for debugging individual items
Comparing Runs
The runs table shows aggregate metrics across all runs for a dataset. Click into any run to see item-level results, or compare multiple runs side-by-side.
Organizing Datasets
Use descriptive names with forward slashes to create folder structures:
evaluation/qa-dataset
evaluation/safety-checks
benchmarks/model-comparison-v2
production-edge-cases/auth-failuresAPI Access
All dataset operations are available via the Python and JavaScript SDKs:
# List datasets
datasets = ants.get_datasets()
# Get a specific dataset
dataset = ants.get_dataset("customer-support-qa")
# List items
items = dataset.itemsSee the Python SDK and JS/TS SDK docs for full API reference.