Key Takeaways

  • Satya Nadella believes that "private eval" — your company's unique, proprietary evaluation mechanism for AI models — is emerging as the most significant form of intellectual property, far outweighing the general models themselves.
  • Publicly available evaluation benchmarks are becoming less useful because they can be easily maximized, meaning they no longer serve as reliable indicators of true model performance or specialization.
  • Building "clean lineage" AI models is critical. This means starting with high-quality data during pre-training and meticulously removing irrelevant or low-quality data through "ablation" to ensure a solid foundation.
  • Founders should focus on creating a "hill climbing scaffold" around generalist models, allowing them to specialize and continuously improve their agents using unique internal data traces and private evaluation.
  • The ultimate test of control over your AI capabilities is "model portability"—the ability to switch underlying foundation models (e.g., from model A to model B) while still achieving superior performance with your private eval.

The Real IP: Your Private Evaluation

For ambitious founders, the mental model of AI intellectual property is shifting. Microsoft CEO Satya Nadella argues that true control and competitive advantage won't come from owning the largest foundation models, but from something far more specific: your proprietary evaluation systems. He calls it "private eval," and says it may be "the biggest IP."

Public benchmarks, Nadella points out, are rapidly losing their edge. “We know all the eval out there are good, interesting, but they're not really that critical at this point because they all can be maxed,” he explains. This means if you're building a product and relying solely on open-source evals to track progress, you're missing the point. The real game is how you can internally measure, refine, and specialize a model for your unique use case using your unique data traces.

Think of it as building a "hill climbing scaffold" around a generalist model. This scaffold isn't just about fine-tuning; it's about continuously collecting interaction data, learning from it, and evaluating improvements with a mechanism that only you possess. This private feedback loop is what allows you to relentlessly improve specialized agents, turning a generic capability into a unique product.

Why Clean Lineage is Your Foundation

Before you can effectively climb that hill, you need a solid base. Nadella stresses the absolute necessity of building AI models with a "clean lineage." This isn't just about good data; it's about obsessive quality control from the very start. He describes it as "starting with pre-training with very good data quality doing all the abilations making sure because in in some sense it's become even harder to build a clean lineage model just because there's so much stuff out there that you truly need to ablate out to be able to have a fantastic pre-trained model."

In an age flooded with vast, often messy, datasets, the ability to curate, clean, and meticulously remove noise (what he calls "ablation") becomes a superpower. This foundational work allows companies to "pursue finding that cognitive core" of their AI, preventing common pitfalls and ensuring a reliable base for specialization. Without a clean lineage, any private eval will be built on shaky ground, making true, consistent improvement nearly impossible. It means being ruthless about data provenance and quality, even when the volume of available data feels overwhelming.

What to Do With This

Stop chasing the latest general model. Instead, start designing and implementing your own proprietary evaluation datasets and metrics that directly reflect your product's specific goals and user interactions. This week, task your engineering team with documenting how you'd collect unique "traces" (user interactions, feedback loops, edge cases) and build a private evaluation system around one critical agent in your product. Crucially, ask: "You have an eval that's private, you're using a G model A can you switch it to model B and you climb up. If you can, then you're in control. If you can't, you're not in control." Begin actively testing the portability of your agent across different foundation models using your private eval to ensure you truly own your AI capabilities, not just rent them.