AIModels.fyi

AIModels.fyi

Can AI *really* research like us? This new framework puts it to the test.

DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation

aimodels-fyi's avatar
aimodels-fyi
Jan 16, 2026
∙ Paid

We’ve built AI systems that can spend hours hunting across the web, synthesizing information, and writing research reports. But we have almost no way to tell if they’re actually good at this task.

The problem runs deeper than it first appears. Traditional benchmarks work fine for closed-form questions with single correct answers. Feed a system a math problem, check if it matches the known solution, move on. But research is different. There are many valid approaches to answering a question about renewable energy policy, and multiple correct answers depending on what sources you integrate and how you weight them. A static answer key doesn’t capture this nuance.

There’s a worse problem hiding underneath: static ground truth becomes obsolete. If your benchmark was created last year and a system is researching current events, comparing it to pre-written answers makes no sense. The world has moved on.

Current benchmarks also impose a heavy cost. Creating reliable research tasks requires human annotation at scale, which is expensive and slow. Existing approaches either demand painstaking effort to construct each task, assume evaluation criteria are universal (they’re not, a business analyst needs different things than a historian), or fail completely when systems cite sources that don’t exist or skip citations altogether.

DeepResearchEval addresses this by automating both the creation of realistic research challenges and the evaluation of how well systems handle them. The insight that ties everything together: you can’t fairly evaluate research systems without task-specific evaluation criteria, and you can’t verify factual claims without an evaluator that actively hunts for evidence rather than checking a static answer key.

What makes a real research task

Before grounding a solution, it helps to think about how real research actually works. A person doesn’t start with a random question. They first think about who they are, what they’re trying to accomplish, and why it matters. A journalist investigating corporate fraud needs different information than a grad student studying historical trade patterns. Their research process, their information needs, and what constitutes a good answer all flow from their identity and stakes.

User's avatar

Continue reading this post for free, courtesy of aimodels-fyi.

Or purchase a paid subscription.
© 2026 AIModels.fyi · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture