Experiments
An experiment runs a prompt + model configuration against every item in a dataset, records the outputs, and scores them — so you can compare prompt versions or models on consistent inputs before you ship.
Experiments run against a dataset. Create or pick a dataset first, then start a run from it.
Concepts
| Concept | Meaning |
|---|---|
| Experiment / run | One execution of a (prompt + model) over a dataset, producing a result per item. |
| Baseline vs. comparison | The model/config you're testing, optionally against alternatives to compare. |
| Run item | The result for a single dataset item — output, latency, cost, and scores. |
| Variables | Dataset items must supply the variables your prompt requires; the platform validates this before running. |
Running an experiment
Pick a dataset
Open the dataset you want to benchmark against. Each item provides the inputs (and optionally expected outputs).
Choose prompt and model
Select the prompt version and the model + parameters to test. The platform checks that dataset items provide every variable the prompt needs.
Run
The platform executes the configuration over every item, capturing output, latency, and cost, and runs your evaluators to score each result.
Compare
View runs side by side — aggregate scores, cost, and latency — and drill into item-level results (each links back to its trace for debugging).
Pair experiments with LLM-as-a-Judge evaluators so every run is scored automatically, giving you an objective quality comparison alongside cost and latency.
Next steps
- Datasets — build the test set experiments run against.
- LLM-as-a-Judge — score experiment outputs.
- Prompt Management — version the prompts you're comparing.