For AI Labs
Long-horizon data for model evaluation
Long-horizon, verifiable workflows from frontier experts. Refined and packaged for model evaluation.
What you receive
Three outputs from one refinery, continuously refreshed.
Benchmarks
Dynamic suites built from authentic expert workflows. Cross-model, continuously refreshed, grounded in real task outcomes.
Evaluation Datasets
Outcome-verified, cross-model traces packaged for evaluation. Every trace carries intent and a verified result.
Verified Outcomes
Ground-truth labels grounded in task success and expert judgement. Derived from real workflows.
How it’s built
Four properties that define long-horizon data and shape every delivery.
Authentic by construction
Real expert workflows, captured at the source. Failure modes, edge cases, and the long-tail problems that drive frontier capability gains.
Verifiable outcomes
Code runs or it doesn't. Experiments replicate or they fail. Every trace carries a ground-truth outcome.
Long-horizon by default
Research is iterative. Weeks of method design, experiments, and rework go into every result. We capture the whole arc.
Continuously refreshed
New traces flow through the pipeline as researchers work. Datasets and benchmarks update with each cycle.