AI Benchmarks Lie: Compute, Not Model, Drives Results
OpenAI's Noam Brown argues traditional AI benchmarks misleadingly ignore test-time compute. Learn why this oversight distorts model capabilities and how to evaluate smarter.
40 hours of podcasts, in 5 minutes.
This episode features OpenAI research scientist Noam Brown discussing the shortcomings of current AI model evaluation benchmarks, particularly their failure to account for large-scale test-time compute. He explains how this oversight impacts the assessment of model capabilities and has significant implications for AI safety and responsible scaling policies. Brown also shares insights into the true nature of recursive self-improvement and the potential of latent capabilities in current models.
OpenAI's Noam Brown argues traditional AI benchmarks misleadingly ignore test-time compute. Learn why this oversight distorts model capabilities and how to evaluate smarter.
OpenAI's Noam Brown used GPT-5.5 to build poker bots 100x faster, showing LLMs excel at optimizing existing solutions, not inventing novel ones.
OpenAI's Noam Brown reveals why founders overlook massive latent AI capabilities in models like GPT-5.5. Unlock them now, don't wait.
OpenAI's Noam Brown reveals AI's 'fast takeoff' is a myth. Current models are bottlenecked by test-time compute, making progress gradual.