AI Evals Are Modern PRDs: Quantify Design Taste with Anker Goyel's Method

Anker Goyel, a sharp mind in the AI space, dropped a truth bomb for founders in their 20s and 30s: machine learning changes how we program. It's no longer about dictating how a system works, but rather about defining what success looks like. This shift forces a complete rethink of product development, particularly when it comes to the squishy, subjective realm of design and user experience.

“Machine learning specifically shifts the task of programming from being about the how to being about the what,” Goyel shared. For him, this means traditional Product Requirement Documents (PRDs) are dead. Long live AI evals. Goyel argues that “EVELs are actually the modern version of a PRD,” replacing vague prose with concrete, quantitative examples that spell out success criteria. But what happens when 'success' is less about a number and more about a 'vibe'? That's where his team's ingenious "Quantifying David" method comes in.

Key Takeaways

Machine learning fundamentally changes programming from defining how to build something, to defining what the successful outcome looks like.
Anker Goyel believes AI evals are the modern PRD, allowing teams to specify success criteria quantitatively with examples, moving beyond mere prose.
His team developed “The Quantifying David Method” to translate qualitative design 'vibe checks' from their expert designer, David, into measurable scoring functions for AI evals.
This approach lets them scale David's high-quality aesthetic across far more products and iterations than manual review could, significantly raising the overall quality bar.

The Quantifying David Method for Scaling Design Taste with AI Evals

This method empowers teams to formalize and scale subjective expert judgments, like a designer's 'taste,' using AI evaluations.

Step 1: Initial AI Eval Development: Run a ton of AI evals to quantitatively improve product aspects, such as documentation answers, until your own less sophisticated palate finds the results good.
Step 2: Expert 'Vibe Check': Present the AI-generated results to a designated expert (e.g., a designer with high taste, 'David') for a qualitative 'vibe check' typically once every few days.
Step 3: Capture and Quantify Expert Feedback: When the expert criticizes the results, go back and try to capture their qualitative feedback (e.g., 'David actually thinks it's okay to show both languages as long as X, Y, Z') and translate it into improvements for the AI eval scoring functions. Attempt to 'quantify David's taste.'
Step 4: Iterate and Refine Evals: Improve the scorers based on the quantified expert feedback, ensuring that previous mistakes are not repeated in subsequent iterations, while still seeking periodic 'vibe checks' for new insights.

When This Works (and When It Doesn't)

This method works best for scaling subjective quality judgments or 'taste' across a large volume of AI-generated content or features. It makes an expert's insights applicable to more things than manual review would allow, ultimately raising the overall quality bar. It addresses the challenge of moving beyond individual examples to generalizable quality improvements. Goyel points out, “it's not practical for David who has like the ultimate who's the ultimate brain trust taste maker to look at everything manually.”

This method shines when you have a clear, consistent "taste maker" whose judgment is excellent but whose time is severely limited. It is less useful if your quality issues are around objective functional bugs rather than subjective user experience or aesthetic polish. It also demands a willingness to invest in iterative eval refinement, not a one-and-one setup. If your expert's 'taste' is too volatile or context-dependent to be codified into consistent rules, this method will struggle to generalize.

What to Do With This

Want to apply "The Quantifying David Method" this week? Consider your company's brand voice. Say you're a founder using AI to draft blog posts and social media captions, but your Head of Marketing, Sarah, has a specific, hard-to-pin-down brand tone that needs to shine through. Sarah is currently buried in manual reviews. Here’s how you’d use the method:

1. Initial AI Eval Development: Generate a batch of AI-powered content. Get it to a baseline quality where an internal, non-expert review deems it "okay."

2. Expert 'Vibe Check': Present these content drafts to Sarah. She will perform her 'vibe check,' highlighting where the tone is off, the humor falls flat, or the messaging feels generic for your brand.

3. Capture and Quantify Expert Feedback: Instead of just fixing the specific examples, document Sarah's feedback. For instance, she might say, "Puns land better if they're self-deprecating, not boastful." Or, "Always use active voice when discussing product features, but passive voice for industry trends." Translate these qualitative critiques into measurable scoring functions for an AI eval. As Goyel puts it, you're “attempt[ing] to quantify David.”

4. Iterate and Refine Evals: Rerun the AI with the updated evals and present new batches to Sarah periodically. You will move from Sarah correcting individual posts to Sarah improving the system that generates the posts, scaling her unique brand voice across all your content output.