For AI Labs

Long-horizon data for model evaluation

Long-horizon, verifiable workflows from frontier experts. Refined and packaged for model evaluation.

What you receive

Three outputs from one refinery, continuously refreshed.

Dynamic suites built from authentic expert workflows. Cross-model, continuously refreshed, grounded in real task outcomes.

Outcome-verified, cross-model traces packaged for evaluation. Every trace carries intent and a verified result.

Ground-truth labels grounded in task success and expert judgement. Derived from real workflows.

Four properties that define long-horizon data and shape every delivery.

Real expert workflows, captured at the source. Failure modes, edge cases, and the long-tail problems that drive frontier capability gains.

Code runs or it doesn't. Experiments replicate or they fail. Every trace carries a ground-truth outcome.

Research is iterative. Weeks of method design, experiments, and rework go into every result. We capture the whole arc.

New traces flow through the pipeline as researchers work. Datasets and benchmarks update with each cycle.

Tell us what you’re evaluating and we’ll take it from there.