Docs/Llmops/Experiments

Experiments

An experiment runs a prompt + model configuration against every item in a dataset, records the outputs, and scores them — so you can compare prompt versions or models on consistent inputs before you ship.

Experiments run against a dataset. Create or pick a dataset first, then start a run from it.

Concepts

ConceptMeaning
Experiment / runOne execution of a (prompt + model) over a dataset, producing a result per item.
Baseline vs. comparisonThe model/config you're testing, optionally against alternatives to compare.
Run itemThe result for a single dataset item — output, latency, cost, and scores.
VariablesDataset items must supply the variables your prompt requires; the platform validates this before running.

Running an experiment

Pick a dataset

Open the dataset you want to benchmark against. Each item provides the inputs (and optionally expected outputs).

Choose prompt and model

Select the prompt version and the model + parameters to test. The platform checks that dataset items provide every variable the prompt needs.

Run

The platform executes the configuration over every item, capturing output, latency, and cost, and runs your evaluators to score each result.

Compare

View runs side by side — aggregate scores, cost, and latency — and drill into item-level results (each links back to its trace for debugging).

Pair experiments with LLM-as-a-Judge evaluators so every run is scored automatically, giving you an objective quality comparison alongside cost and latency.

Next steps

© 2026 ANTS Platform, Inc.Docs v1.0 · Last updated June 2026