AINo Priors

Why Traditional Benchmarks Fail Modern AI Models with OpenAI Research Scientist Noam Brown

With Sarah Guo, Noam Brown · Sunday, June 28, 2026

This episode features OpenAI research scientist Noam Brown discussing the shortcomings of current AI model evaluation benchmarks, particularly their failure to account for large-scale test-time compute. He explains how this oversight impacts the assessment of model capabilities and has significant implications for AI safety and responsible scaling policies. Brown also shares insights into the true nature of recursive self-improvement and the potential of latent capabilities in current models.

Watch on YouTube ↗More from AI →

AI Benchmarks Lie: Compute, Not Model, Drives Results

OpenAI's Noam Brown argues traditional AI benchmarks misleadingly ignore test-time compute. Learn why this oversight distorts model capabilities and how to evaluate smarter.

Read article →

Your LLM Is a 100x Optimizer, Not a Novelist

OpenAI's Noam Brown used GPT-5.5 to build poker bots 100x faster, showing LLMs excel at optimizing existing solutions, not inventing novel ones.

Read article →

Stop Waiting for GPT-X: Your Current AI Has Hidden Power

OpenAI's Noam Brown reveals why founders overlook massive latent AI capabilities in models like GPT-5.5. Unlock them now, don't wait.

Read article →

No Overnight Explosion: AI's Recursive Self-Improvement is Bottlenecked by Time

OpenAI's Noam Brown reveals AI's 'fast takeoff' is a myth. Current models are bottlenecked by test-time compute, making progress gradual.

Read article →