Key Takeaways
- Current AI model benchmarks, often presented as a single-point "grid," fail to account for the amount of compute spent during evaluation, masking true model capabilities.
- Modern models like OpenAI's GPT-5.5 can achieve significantly higher performance by "thinking" for extended periods—weeks or even months—a crucial factor ignored by standard evaluations.
- This oversight leads to skepticism about genuine model improvements and hinders responsible scaling efforts, as a model's 'capability' becomes directly tied to its test-time compute budget.
- Proper AI evaluation requires plotting model performance as a function of test-time compute (e.g., tokens, cost, time) to reveal a more complete and accurate picture of its potential.
The Benchmark Deception
If you're building with AI, you've seen the grids. The clean rows of numbers, confidently declaring one model superior to another. But what if those numbers are lying to you? OpenAI research scientist Noam Brown believes they are. He argues that traditional AI benchmarks misleadingly represent model capabilities by completely ignoring a critical factor: test-time compute. “The current frameworks and responsible scaling policies, they don't really account for the amount of test-time compute,” Brown explains. “They just say, 'Okay, well, what's the capability of the model?' The problem is we're in a world now where the capability of the model is a function of how much money you put into it.”
This isn't about training costs; it's about the "thinking time" a model gets during evaluation. Imagine a human taking a test. If they get an hour, they score X. If they get a week, they score Y, which is likely much higher. Current AI benchmarks are like only reporting X, even when the model could easily hit Y with more effort.
The Longer Game: Latent Capabilities Unveiled
The real tension comes from the new generation of AI models. Brown points out that models like GPT-5.5 are not static. Given more time to process, explore, and refine their outputs—what he calls “thinking time”—their performance continues to climb. “What we're seeing today with the modern models is that 5.5 and other models can think for if you scaffold them reasonably well, can think for weeks even… before having performance plateau on some of these benchmarks,” Brown says. This means a model that looks merely 'good' on a standard benchmark might actually possess significant latent capabilities, just waiting for a larger compute budget to be unlocked.
This extended thinking isn't always reported, leading to a distorted view of progress. Sarah Guo, co-host of the podcast, notes the industry is stuck in a “bad equilibrium” where everyone knows the benchmarks are flawed but no one wants to break rank. This cycle of publishing incomplete data hinders a clear understanding of what models can truly do, and by extension, our ability to scale them responsibly.
Demanding a Clearer Picture
Brown advocates for a straightforward fix: plot performance against test-time compute. This isn't just about fairness; it's about accuracy. Knowing that a model achieves a certain score after 100,000 tokens versus 10 million tokens completely changes its perceived value and safety implications. “My claim is the proper way to evaluate the models now is you either have some kind of budget for the benchmark whether it's tokens or cost or time or whatever, or you plot the performance as a function of the amount of test-time compute that's going into the model,” Brown states. This method would force transparency and allow founders and researchers to make informed decisions based on a model's true potential, not just an arbitrary snapshot.
What to Do With This
Next time you evaluate an AI model, whether a foundation model or one of your own builds, don't just ask for a single performance score. Insist on seeing performance plotted against a test-time compute budget. If that data isn't available, treat the reported benchmark as a lower bound. Assume the model likely has significant untapped capabilities that could be unlocked with more "thinking" time. This shift will help you accurately assess value, optimize costs, and build more robust AI products.