Key Takeaways

  • Compilers with auto-vectorization are "not even close" to matching handwritten assembly for critical code paths, delivering 10x to 60x slower performance, according to Kieran Kunhya.
  • The dav1d project, an AV1 decoder with 240,000 lines of assembly, showcases this extreme optimization, yielding massive speed improvements over C code and compiler-generated binaries.
  • In the post-Moore's Law era, where hardware speed gains have slowed, optimizing "every cycle matters" by going deeper into the stack is essential to extract maximum performance from existing machines.
  • Achieving these gains often involves "abusing the machine"—using instructions, like cryptography instructions, in ways unintended by hardware creators to gain efficiency, a skill Jean-Baptiste Kempf calls a "lost art."

The Method: Abusing the Machine for Extreme Performance

Forget what you heard about compilers fixing everything. For ambitious founders building performance-critical systems, the true path to speed isn't in waiting for the next chip or trusting clever algorithms alone. It's about a 'lost art' of handwritten assembly, pushing hardware far beyond its intended limits.

Jean-Baptiste Kempf and Kieran Kunhya, the architects behind FFmpeg and VLC, laid out a startling reality on the Lex Fridman podcast: modern compilers, even with advanced auto-vectorization for SIMD operations, simply cannot compete with human-crafted assembly when "every cycle matters." Kunhya didn't mince words: “It's not like 5%, 10% slower. It's multiple times slower.” He’s talking 10x to 60x slower in some cases, a gap that directly translates to wasted energy and poorer user experiences at scale.

The dav1d project, an open-source AV1 video decoder, stands as a testament to this philosophy. This beast includes 240,000 lines of meticulously handwritten assembly code. Why such extreme effort? Because dav1d is used in VLC and other AV1 playback stacks across potentially 3 billion devices globally. For Kempf, the motto driving the project is clear: "every cycle matters" when you're powering video on that many screens, non-stop.

This isn't just about marginal micro-optimizations. It’s about a profound, almost intimate understanding of the underlying hardware architecture – so profound, in fact, that you start to "abuse the machine," as Kunhya puts it. He describes using instructions completely unrelated to their original purpose, like repurposing a cryptography instruction for video processing, just to shave crucial cycles. This level of deep, architectural insight allows developers to squeeze efficiency from hardware where general-purpose compilers, built for broader compatibility and ease, would never dare to tread. It is, as Kempf says, "an art," one almost unique to projects like dav1d, defying the conventional wisdom that compilers are the ultimate optimizers.

Where This Breaks Down

This extreme optimization isn't for every project, or even most. The sheer cost in time, specialized talent, and ongoing maintenance for hundreds of thousands of lines of assembly is astronomical. Finding engineers with the "art" of abusing machines in ways their creators didn't expect is incredibly difficult, often requiring decades of specialized experience. This approach is only viable for truly critical, foundational technologies—like video codecs, operating system kernels, or high-frequency trading engines—that run on billions of devices and where every fraction of a millisecond translates into colossal energy savings, reduced infrastructure costs, or indispensable user experience improvements globally. Applying this level of optimization to a standard web application, most SaaS products, or any project where development velocity and time-to-market are primary concerns would be a gross misallocation of resources. It would likely trade minor, imperceptible performance gains for massive development debt and slow iteration. This method breaks down decisively when the "every cycle matters" threshold isn't genuinely met, or when the scale of impact doesn't justify the monumental effort.

What to Do With This

Identify your mission-critical "choke points" – the 1% of your code that executes millions or billions of times and directly impacts user experience or operating costs. If you’re building foundational infrastructure, a global service with high-frequency transactions, or embedded systems where power efficiency is paramount, then challenge the assumption that high-level languages and compilers are sufficient. Find a specialized "artist" (or train your best engineers) who deeply understands your target architecture and can identify opportunities to "abuse the machine." Start with profiling tools to pinpoint where even tiny cycle savings have outsized impact, then consider unconventional, low-level optimizations for those specific hot paths.