Key Takeaways

  • Dwarkesh Patel questions whether training AIs in "RL in verifiable environments" (RLVR) can truly generalize beyond simple tasks to complex, real-world problems like building a business or navigating social situations.
  • Dario, cited by Patel, notes that models trained on short context lengths degrade at longer ones, limiting how much 'on-the-job' learning they can retain for extended interactions.
  • A critical inefficiency plagues current AI: valuable knowledge gained 'in context' during deployment is ephemeral. If it can't be consolidated back into the model's core weights, it's wasted.
  • An estimated 30% to 50% of a leading AI lab's compute budget goes to inference. This massive expenditure currently doesn't improve the model's underlying capabilities, despite deployment offering the most valuable learning opportunities.
  • To achieve true continual learning, AIs must update their weights based on new experience, rather than relying on an unsalable, ever-growing KV cache—a method that mirrors how humans learn.

The Billion-Dollar Blind Spot in AI Training

Imagine spending billions, potentially trillions, on an AI that can ace simulations but trips over real-world problems. That's the core critique Dwarkesh Patel raises about the current AI training paradigm, particularly the "big research bet" on scaling Reinforcement Learning in Verifiable Environments (RLVR). Patel argues this approach, while effective in controlled settings, might not generalize to the messy, long-horizon tasks of actual business or social interaction. He doesn't just theorize; he cites an empirical question looming over the industry:

“Now, whether RLVR can generalize this well is an empirical question. If the labs went from spending billions of dollars on RL environments to a trillion dollars, would you get the kind of thing that is a fully human-like general intelligence within the context window?”

The problem isn't just about training environments. It's about how much an AI can truly learn and retain over time. Patel references Dario, who pointed out a fundamental limitation: “When he was explaining why model performance tends to degrade at long context, he said: 'There's two things. There's the context length you train at, and there's a context length that you serve at. If you train at a small context length and then try to serve at a long context length, maybe you get these degradations.'” This means even if an AI starts to grasp complex scenarios during a lengthy user interaction, that understanding might not stick around for the next conversation, let alone be applied to a new challenge. It’s like an employee who learns critical lessons on a project, then forgets them entirely by Monday.

Why Your AI Forgets Everything It Just Learned

The real punch to the gut for ambitious founders is the sheer waste. What if your most valuable AI insights are generated, used once, and then vanish into the digital ether? Patel calls this "ephemeral learning." An AI might, after enough in-context experience, start to act like a strategic genius, but that brilliance is fleeting if it can't be permanently recorded. "And even if, after enough in-context experience, the AIs could become like Henry Ford or Albert Einstein, all that would be ephemeral and wasted if you couldn't get those learnings back into the weights."

This isn't just an academic problem; it's a massive drain on resources. Patel reveals that “Around 30 to 50 percent of a lab's compute goes to inference, and that compute is currently not playing any productive role in helping improve the model. This seems like a huge waste.” Think about that: half of a lab's massive compute power is effectively a one-way street, delivering outputs without accumulating internal intelligence. Deployment—the real world—offers the richest learning opportunities, but current systems mostly let that data pass by without consolidating it.

For AIs to truly improve and adapt like humans, they need to bake their experiences into their core structure. This requires "continual learning" that updates the model's weights. As Patel explains, "AIs can't just keep building up a bigger and bigger KV cache as they learn from more and more users. That's just not scalable, and that's also not how humans do it." We don't just add to our short-term memory; we integrate new insights into our long-term understanding.

What to Do With This

If you're building an AI product, stop treating inference as a one-off transaction. Design explicit feedback loops and data pipelines that consolidate user interactions and specific task learnings back into your model's training data. This week, review your data strategy: are you capturing the why and how of every successful AI interaction, turning ephemeral wins into permanent model improvements?