Key Takeaways

  • The common view of AI agent testing "overindexes" on simple 'computer use' – things like clicking a button or filling a field, which are relatively straightforward for AIs to mimic.
  • Real AI testing is a far more complex 'problem-solving' challenge, requiring agents to reason through multi-application orchestration and deep codebase context.
  • Agents must navigate changes that span front-end, back-end, and nested services, coordinating actions across systems, not just within a single UI.
  • A single frontier model often can't handle these end-to-end testing tasks, demanding instead the careful orchestration of multiple agents.
  • Founders should evaluate AI agents not on their ability to mimic basic human computer interaction, but on their capacity for deep code reasoning and multi-system problem-solving.

The Illusion of Simple AI Testing

If you're building with AI agents, you might think you're ahead of the curve just by getting them to interact with a UI. You’ve seen the demos: agents clicking buttons, filling forms, navigating websites. It feels like magic. But Cole Murray, an expert in AI coding agents, warns that this perception, while impressive, misses the true challenge. “I think they actually overindex on the computer use part of it because computer use in my mind is the literal okay you want you know a button you want to click can you emit the right coordinates to go click that button,” Murray explains.

This isn't to diminish the technical feat of 'computer use' for AIs. It's a foundational capability. But for ambitious founders, relying solely on an agent's ability to mechanically reproduce human UI actions is like training an aspiring architect to draw straight lines and calling them a master builder. The real work, the hard part, lies deeper. Walden Yan echoes this, pointing out that “the computer use is kind of a subset of the larger testing problem.” It's a necessary step, but far from sufficient for robust, real-world agent applications.

The Deep Codebase Challenge

The actual hurdle for AI agents isn't the 'clicking' part; it's the 'thinking' part. Murray clarifies, “I think testing is actually a really interesting problem solving challenge for these AIs because if you wanted to do arbitrary testing like imagine you make a change that spans the front end and the back end.” This isn't about UI automation. This is about an agent needing to understand how a specific code change in one part of a system impacts another, potentially across entirely different services or microservices.

To manage this, agents need to grasp deep codebase context, figure out how to orchestrate interactions across multiple applications, and trigger specific features in complex ways. “Figuring out how do you do that requires a lot of codebase context requires a lot of orchestration that we've specifically done,” Murray shares. He adds a critical insight: “in some cases we found that you actually no one frontier model can actually do this full end to end task itself.” This means you're not just deploying one powerful model; you're building a system to coordinate several, each contributing to a broader testing strategy. It’s an order of magnitude more complex than simple UI interaction, involving things like video recordings and detailed annotations to help human reviewers understand what the AI did and why.

What to Do With This

Stop evaluating AI agents solely on flashy UI demos. This week, task your AI development team – or your prospective AI agent vendor – with a true cross-service testing challenge. Give them a bug report that requires changes across your front-end, a specific API endpoint, and a background service, then demand a detailed explanation of how their agent orchestrated the fix and verified it end-to-end. If they can only show you UI clicks, they're not ready for your critical path.