For AI Labs

Long-horizon data for model evaluation

Long-horizon, verifiable workflows from frontier experts. Refined and packaged for model evaluation.

What you receive

Three outputs from one refinery, continuously refreshed.

Benchmarks

Dynamic suites built from authentic expert workflows. Cross-model, continuously refreshed, grounded in real task outcomes.

Evaluation Datasets

Outcome-verified, cross-model traces packaged for evaluation. Every trace carries intent and a verified result.

Verified Outcomes

Ground-truth labels grounded in task success and expert judgement. Derived from real workflows.

How it’s built

Four properties that define long-horizon data and shape every delivery.

Authentic by construction

Real expert workflows, captured at the source. Failure modes, edge cases, and the long-tail problems that drive frontier capability gains.

Verifiable outcomes

Code runs or it doesn't. Experiments replicate or they fail. Every trace carries a ground-truth outcome.

Long-horizon by default

Research is iterative. Weeks of method design, experiments, and rework go into every result. We capture the whole arc.

Continuously refreshed

New traces flow through the pipeline as researchers work. Datasets and benchmarks update with each cycle.

Let’s talk

Tell us what you’re evaluating and we’ll take it from there.

contact@openrefinery.ai