

Built for model evaluation
The long-horizon data layer for frontier AI
Real, long-horizon workflows from expert researchers. Refined into evaluations and benchmarks for frontier AI.
What is long-horizon data?
Real-world, long-horizon workflows captured at the source. Refined into something a model can actually learn from.
Captured directly from authentic, expert work. Each trace carries intent, outcome, and the rework that produced it.
A long-horizon research arc
- 1
Question
- 2
Literature review
- 3
Method design
- 4
Experiments
- 5
Iteration
- 6
Result
Research is iterative. Weeks of method design, experiments, and rework before a result lands.
How it compares
Four properties together
Real, signal-rich, verifiable, and reproducible. All at once.
| Category | Real | Signal-Rich | Verifiable | Reproducible |
|---|---|---|---|---|
Human Annotation | ❌ | ❌ | ✅ | ✅ |
RL Environments | ❌ | ✅ | ✅ | ✅ |
Raw Capture Tools | ✅ | ❌ | ❌ | ❌ |
Internal Capture | ✅ | ✅ | ❌ | ❌ |
How long-horizon data is refined
Sources
Data Sources
Real workflows · Real outcomes
Process
Refinery
Destination
Frontier AI Labs
Benchmarks · Datasets · Outcomes
With consent · PII redacted
Where do you fit?
Two ways to engage. Pick the side of the pipeline that matches your work.
For AI Labs
Long-horizon data for model evaluation
Long-horizon, verifiable workflows from frontier experts. Refined for model evaluation.
Learn more→For Researchers
Working with researchers and technical experts
If your daily work produces long-horizon, verifiable workflows in a technical domain, we'd like to hear from you.
Learn more→Let’s talk
Whether you build frontier models or work at the frontier of your field, we’d like to hear from you.
contact@openrefinery.ai