Key Takeaways
- Axiom Math's AI achieved a perfect 120 score on the challenging Putnam exam in December 2025, outperforming the best human (110 points) and leading LLM Deepseek (103 points).
- This superhuman performance comes from a unique methodology centered on "Lean data," which are formally verified proofs. Unlike informal LLMs that generate probabilistic answers, Lean data ensures correctness by design.
- Axiom Math views converting mathematical proofs into formal code or programs as the path to superior and scalable AI performance in math.
- For true Math AGI, the human bottleneck in verifying the outputs of informal LLMs makes that approach impractical. Axiom's formal verification scales brilliance, not just error correction.
The Method: Formal Verification for Superhuman Math
Forget the hype around general-purpose LLMs hallucinating their way through complex problems. Carina Hong, CEO of Axiom Math, laid out a specific, rigorous path to AI that doesn't just guess at answers – it proves them. Their recent perfect score on the prestigious Putnam exam wasn't a fluke; it was the direct result of a calculated methodology.
While other "frontier labs" pour resources into training massive, informal language models, Axiom Math bets on structure. Hong explains, "We heavily rely on a kind of data called lean data... it's correct so you you know it's correct or not and that's quite quite important." This isn't just about finding solutions; it's about building solutions with mathematical certainty.
Their approach contrasts sharply with the stochastic nature of informal LLMs. For the December 2025 Putnam exam, the best human student scored 110 points. Deepseek, a leading LLM, hit 103. Axiom Math's system, however, scored a flawless 120 points. “We generally think that formal mass then by sort of converting mass proofs to programs to code give us much better performance,” Hong states. This isn't just about being good; it's about being unequivocally correct.
The core of Axiom Math's method is using formal verification, particularly with the Lean language. They train their AI not just on mathematical texts, but on formally verified proofs. This 'Lean data' provides a ground truth that traditional LLMs lack, leading to superior sample efficiency and deterministic outcomes. Hong points out that relying on informal LLMs means constant human oversight for verification, which simply doesn't scale for something as demanding as Math AGI. “My suspicion about like you know whether we can scale to mass AGI just by the informal approach is you're going to keep having... human experts who grade And it's just human experts like doesn't scale that well.”
Where This Breaks Down
While Axiom Math's method demonstrates incredible power in highly structured domains like mathematics, it's not a silver bullet for every problem. Formal verification requires domain expertise, specialized languages like Lean, and significant upfront investment in defining the system's rules. This makes it challenging for problems that are ill-defined, constantly changing, or where the 'correct' answer is subjective rather than absolute.
For areas demanding creativity, intuition, or dealing with the messy, unstructured data of the real world (think natural language understanding or abstract reasoning), a purely formal approach might be too rigid or too slow. Hong herself acknowledges this, suggesting, “The thing is the the thing is the informal stuff is also available to us in a way if you really like you can have a both informal and formal system and that is going to be very strong.” This implies that while formal methods provide an undeniable backbone of correctness, a hybrid system, strategically combining the strengths of formal rigor with the exploratory power of informal techniques, might ultimately yield the most robust and versatile AI.
What to Do With This
Don't write off formal verification as only for math PhDs. Take the core insight: rigor beats stochasticity when correctness is non-negotiable. Identify one critical process or data pipeline in your startup where an error could be catastrophic – financial transactions, API contract enforcement, or core business logic. Instead of just testing for bugs, ask how you can introduce 'formally verified' principles. Could you use stricter type systems, declarative configuration, or even a lightweight formal specification for your most critical components? Pull your last three incident reports. For each one, consider if a "correct-by-construction" mindset could have prevented it, then implement one small, formal step this week to lock down that risk.