Docs/Llmops/Experiments

Experiments

An experiment runs a prompt + model configuration against every item in a dataset, records the outputs, and scores them — so you can compare prompt versions or models on consistent inputs before you ship.

ℹ

Experiments run against a dataset. Create or pick a dataset first, then start a run from it.

Concepts

Concept	Meaning
Experiment / run	One execution of a (prompt + model) over a dataset, producing a result per item.
Baseline vs. comparison	The model/config you're testing, optionally against alternatives to compare.
Run item	The result for a single dataset item — output, latency, cost, and scores.
Variables	Dataset items must supply the variables your prompt requires; the platform validates this before running.

Running an experiment

Pick a dataset

Open the dataset you want to benchmark against. Each item provides the inputs (and optionally expected outputs).

Choose prompt and model

Select the prompt version and the model + parameters to test. The platform checks that dataset items provide every variable the prompt needs.

Run

The platform executes the configuration over every item, capturing output, latency, and cost, and runs your evaluators to score each result.

Compare

View runs side by side — aggregate scores, cost, and latency — and drill into item-level results (each links back to its trace for debugging).

ℹ

Pair experiments with LLM-as-a-Judge evaluators so every run is scored automatically, giving you an objective quality comparison alongside cost and latency.

Next steps

Datasets — build the test set experiments run against.
LLM-as-a-Judge — score experiment outputs.
Prompt Management — version the prompts you're comparing.