Protein AI: Alex Rives Scaled Data, Not Just Models

Key Takeaways

The "Bitter Lesson" applies to biology: Alex Rives, a key figure at Biohub, champions the principle that empirical scaling with massive data often outperforms human intuition or handcrafted rules in building powerful AI, even for complex protein biology.
Metagenomics was the breakthrough: While prior models like ESM2 saw diminishing returns from just increasing parameters, ESMC's leap came from adding billions of new, diverse sequences from metagenomic datasets, not just more compute or model size.
Emergence of a "world model": By scaling data from evolutionary biology, ESMC learned a deep, comprehensive understanding of protein sequence, structure, and function, effectively building a predictive "world model" for proteins without explicit structural priors.
Contrast to inductive bias: This approach directly opposes methods like AlphaFold, which bake in significant inductive biases (human assumptions) about protein structure from the outset. ESMC aims to let the data dictate the structure itself.

The Method: Scaling Data, Not Just Beliefs

For most founders, "AI" means throwing more compute at a problem. But Alex Rives and his team at Biohub demonstrate that the real secret weapon for building advanced AI models might be something much harder to find: a truly new class of data. Rives lives by what some in AI call the "bitter lesson," a philosophy that prioritizes empirical scaling laws over our clever human theories.

“I believe in scaling laws,” Rives told the podcast, a concise statement summing up his approach. His work with ESMC, a protein language model, isn't just a story about bigger computers. It's about finding the "metagenomics equivalent" for your domain.

Previous models, like ESM2, hit a wall. Rives explained, “as we increased the number of parameters and compute... you could see that there's kind of diminishing returns in ESM2. ESM2 is trained on unref.” The obvious move was to just scale the model further, but they found that wasn't enough.

The game changed when they expanded the data. Rives clarifies the critical shift: "What changed between ESM2... and ESMC... The data was was really the critical thing here, actually. So for ESMC, we added metagenomics. So we added billions more sequences to the training data." This wasn't just more of the same data; it was a dramatically broader and more diverse evolutionary signal.

This massive influx of novel data enabled ESMC to move past limitations, letting the model learn an expansive "world model" of protein sequence, structure, and function from the evolutionary record itself. As host Brandon put it, "this is very much in contrast to something like AlphaFold right where you have a lot of inductive bias in built into the model in order to be able to predict protein structure." Rives confirmed this, stating, "the idea here is you know really can we just learn the right structure you know don't give any priors just allow you know allow machine learning to figure out what that structure is."

Where This Breaks Down

The "bitter lesson" approach, while powerful, isn't a silver bullet. Its reliance on massive, diverse datasets means it demands incredible resources, both in terms of data acquisition and the computational power to process it. Not every problem, or every startup, has access to the equivalent of "billions of metagenomic sequences." If your domain has scarce or proprietary data, or if the "signal" within the noise is extremely subtle, simply scaling up might lead to more noise than insight. The method also assumes that the underlying patterns are learnable from data alone, which may not always be true if human-curated knowledge offers a truly unique or highly efficient shortcut.

What to Do With This

Stop trying to squeeze more juice from your existing data sets. Instead, identify the "metagenomics equivalent" for your own business problem. This week, challenge your team to brainstorm three entirely new and unconventional data sources that could be orders of magnitude larger or more diverse than anything you're currently using, even if they seem messy or irrelevant at first glance.