Key Takeaways
- Satya Nadella believes that a company's "private evals" – its unique, internal evaluation datasets and methods – are quickly becoming the most important form of intellectual property in the AI world. This goes beyond raw training data; it's how you prove an AI's effectiveness for your specific needs.
- The true test of AI control, Nadella says, is whether you can switch from one frontier model (like GPT-4) to another (like Claude Opus) and still maintain or even improve performance using your private evals. If you can, you genuinely own your AI capability.
- Microsoft's "harness" strategy aims to give enterprises this control, providing a multimodal framework (models, data, tools) so they can develop specialized agents that retain value, even if they use external foundation models.
- Building AI models with a "clean lineage" – ensuring high-quality data and careful ablations from pre-training onwards – is now harder but more critical than ever for creating reliable, controllable AI systems.
- To truly gauge your company's agency in the AI age, apply Satya Nadella's AI Control 'Acid Test'.
The Satya Nadella's AI Control 'Acid Test'
What makes AI your IP? In an evolving AI landscape, this is the burning question for ambitious builders. Microsoft CEO Satya Nadella offers a surprisingly sharp answer: it’s not just the models you train or the data you feed them, but your “private evals”—the proprietary methods you use to measure an AI's performance. As Nadella puts it, “Every company having private evals maybe the biggest IP. Right? I think about it. Like, what's that private eval that you can then use even a frontier model to hill climb on and not leak the traces maybe one of the biggest drivers uh of IP.”
He introduces a clear framework for true control:
- Condition 1: Private Eval: You have an eval that's private.
- Condition 2: Model Switching: You're using a a model A. Can you switch it to model B and you know, climb up?
- Result: In Control: If you can, then you're in control.
- Result: Not In Control: If you can't, you're not in control.
This framework underscores Microsoft's "harness" strategy, which isn't about locking you into their models. Instead, it’s about providing a multimodal environment that lets your private evals and specialized tools dictate performance. "Having an open harness, letting all models come in, having your evals, your contacts, your tools help you hill climb, I think is the skills that an AI native startup needs, a SaaS company needs, or every enterprise needs," Nadella states. He cited a specific demo: "Interestingly enough, if you add a little temporality to it, you can use, let's say, in in in fact, that the Land O'Lakes demo we showed was pretty cool. We used whatever GPT-55, right? Then you collected a bunch of traces, and then you took a 5B reasoning model and achieved higher uh so, that is another aspect of what it means to appear you know, operate at the frontier."
When This Works (and When It Doesn't)
This framework applies when assessing a company's ability to maintain agency and value capture in an AI ecosystem, particularly regarding their IP and control over their specialized AI capabilities. It's for when you're building a proprietary AI product or a core internal intelligence that gives you a competitive edge. If your business depends on highly accurate, specialized AI outputs, owning your evals is your bedrock.
However, this framework is less relevant for generic AI use cases where differentiation isn't key, or for very early-stage experimentation where the primary goal is simply to get any AI functionality working. If your AI feature is a commodity, or if you're just kicking tires, the acid test for control might be overkill. It's about protecting defensible value, not just building.
What to Do With This
Next week, apply Nadella's 'Acid Test' to your most critical AI-powered workflow or product. Imagine you're building an AI agent that drafts personalized legal contracts based on client inputs. Here's how to use the framework:
1. Define Your Private Eval: Create a rigorous, proprietary evaluation set. This could be 50 anonymized past client cases, each with an expertly drafted contract and specific clauses highlighted. Your AI must match these highlights and draft legally sound, customized text. This eval is your secret sauce.
2. Test Model A: Integrate a leading frontier model, like Claude 3 Opus. Run it against your private eval, measure its performance, and quantify its accuracy and completeness in drafting contracts.
3. Switch to Model B: Now, swap it out for another top-tier model, say, Google's Gemini 1.5 Pro. Run this model against the same private eval. Compare its performance to Model A. Can it still hit your benchmarks? Does it improve?
4. Assess Control: If you can switch between Claude and Gemini, and both deliver comparable or improved results on your private eval, then you are in control. Your specific IP lies in how you measure and tune the AI, not just which model you use. If the switch breaks everything, your intelligence is merely rented, and you're locked into a single model's ecosystem.