OpenAI's 'Evals Crisis': Why Your AI Benchmarks Are Broken

Key Takeaways

AI models are now so advanced they've saturated traditional benchmarks like the SAT, creating an "evals crisis" at OpenAI, according to Chief Research Officer Mark Chen.
The trap of "benchmaxing" means optimizing models to specific evaluation distributions, which hurts true generalization and leads to systems that perform well on tests but fail in the real world.
OpenAI tackles this by constantly creating new evaluations, using a representative mix of tests, and partnering with external organizations for objective "gold standards."
Critically, OpenAI separates the teams responsible for creating evaluations from those optimizing models. This ensures incentives are aligned with true progress, not just gaming the current metrics.

The AI 'Evals Crisis' and the Trap of 'Benchmaxing'

For most builders, benchmarks are a guiding light. Beat the target, and you're winning. But what happens when your AI is so good it can ace every human-designed test? OpenAI's Chief Research Officer Mark Chen explained to Alessio Fanelli that this isn't a hypothetical problem; it's a real "evals crisis" inside their labs. "It gets to a point where it's so good at things that even the top 0.01% of humans can do," Fanelli observed, wondering how to push past that frontier. Chen's answer was stark: "all the really great evals that we all know like growing up like taking the SAT or those those are all fully saturated." These models are now superhuman, making traditional tests meaningless for measuring further progress.

This saturation problem leads directly to a dangerous habit Chen calls "benchmaxing." He warns against it: "I think you can kind of overfit onto certain distributions. Um and it won't be reflective how you how well you generalize, right?" When you relentlessly optimize your AI (or even your team) to hit specific, static metrics, you risk creating a system that's a master of the test but a failure everywhere else. It's like training a runner only on a flat track; they'll shatter records there, but stumble on a hill.

OpenAI's Playbook: Separation, Mixtures, and Fresh Data

So, how does OpenAI navigate this minefield? Their strategy isn't about finding one perfect test, but about building an evaluation system designed for constant change and objectivity. Chen laid out their multi-pronged approach:

First, they rely on “representative mixtures of evals.” No single benchmark tells the whole story, so they combine many different kinds of tests to get a broader picture of a model's capabilities. Second, they invest heavily in continually creating new evaluations. This is a core philosophy: “once an eval is out in the world, then it's it's just already not a good eval.” The moment a test becomes public or widely known, it starts getting gamed, becoming less useful for measuring true generalization.

Third, they actively partner with external organizations. These partners help create "gold standard" evaluations, adding an impartial layer of assessment. Finally, and perhaps most critically, OpenAI rigorously separates responsibilities. "I think there's a kind of interesting philosophy of separate the teams that are creating the evals from the teams that are optimizing the models themselves," Chen explained. This prevents a perverse incentive where teams optimize their model to specific evals they themselves designed, accidentally encouraging self-deception instead of genuine advancement.

What to Do With This

Review your core product or team performance metrics this week. Designate one person (or a small team) solely responsible for defining and refreshing success metrics, explicitly separate from those optimizing the product or process against them. Then, brainstorm 3-5 entirely new, unconventional ways to measure user value or team performance that aren't currently being tracked and can't be easily gamed by existing incentives.