Benioff: Your Multi-Sensory AI Token Costs Are All Wrong

Key Takeaways

Multi-sensory AI, moving beyond text-only large language models, represents the next major wave for AI development, aimed at approaching AGI.
Early multi-sensory applications, such as those described by Jason Calacanis monitoring desktops and webcams every 200 milliseconds, imply a "thousand-fold increase" in token usage.
Mark Benioff challenges the assumption that this translates to exponentially higher costs, proposing an architecture of "intermediary layers" that route tasks to smaller, specialized, and more affordable models.
Chamath Palihapitiya reinforces the efficiency argument by advocating for powerful local desktop models, which also offer significant privacy benefits.
Founders should design their AI systems to optimize for token efficiency from the start, rather than bracing for an inevitable cost explosion.

The Disagreement

Founders are bracing for a future where multi-sensory AI consumes tokens at an unprecedented rate. Jason Calacanis paints a picture of persistent devices like Apple watches or AirPods with cameras constantly processing environmental data. He cites examples like Mira Murati's "Thinking Machines," which “watches your desktop… listens to all the voices and then it's watching your webcam all at the same time and every 200 milliseconds it's uploading it to two different models.” This relentless monitoring implies an astronomical "thousand-fold increase" in token usage, leading many to fear spiraling computational costs and an inevitable tax on every interaction.

But Mark Benioff, CEO of Salesforce, directly challenges this assumed cost curve. While he agrees multi-sensory models are indeed “the next big wave for AI” and a necessary step beyond LLMs to reach AGI, he sees the prevailing view on token expense as a "mistake." Benioff argues that indiscriminately sending all data to massive, expensive models—like routing everything to Anthropic—is wasteful. He envisions "some intermediary layer that's saying oh oh that one has to go to Anthropic but these ones can handle by smaller models that can route it to the most affordable for the job." This suggests an intelligent routing architecture, where tasks are matched to the most cost-effective and specialized AI rather than a one-size-fits-all, high-cost solution.

Chamath Palihapitiya reinforces this efficiency-first perspective by stressing the importance of local processing. “I think the future of this is going to be local models running on extraordinary desktop hardware,” he says. This approach not only offers potential cost savings by reducing reliance on cloud-based tokens but also delivers significant privacy advantages, keeping sensitive user data off remote servers.

Who's Right (and When They're Wrong)

Benioff and Palihapitiya articulate a far more realistic and actionable vision for multi-sensory AI than the simple exponential cost curve suggested by Calacanis's description. While Calacanis accurately identifies the potential for massive data generation from constantly observing systems, he overlooks the inevitable engineering response to such a challenge. Predicting a direct, linear relationship between data input and token cost in a rapidly evolving field like AI is overly simplistic.

Benioff is right: the smart money isn't on just absorbing higher costs by scaling up current LLM pricing models. Instead, it's on building intelligent systems that don't treat every bit of sensory input as equally valuable or equally deserving of a trip to the most expensive, general-purpose AGI model. Think of it like a highly optimized microservice architecture versus a monolithic application. Small, specialized models for specific sensory inputs—detecting a specific gesture or filtering background noise—can pre-process, filter, or even handle requests entirely. This leaves only truly complex, ambiguous tasks for larger, multi-modal AGI systems. This hybrid approach will be crucial for managing both latency and cost, transforming a potential token tsunami into a manageable stream.

However, Benioff's vision might face challenges in the early stages of a new multi-sensory application. Crafting these "intermediary layers" and specialized models is a complex engineering problem, demanding significant upfront investment in design and development. For founders pushing the absolute frontier, the initial cost might indeed be higher before these efficiencies are fully realized. But the long-term viable path will undeniably involve such an intelligent routing system.

What to Do With This

If you're building in AI, stop assuming future token costs will scale linearly with data input. Instead, start designing your multi-sensory AI architecture today with Benioff's "intermediary layers" in mind. This week, task your technical lead with identifying which sensory inputs can be pre-processed or handled by smaller, specialized models before they ever hit a massive multi-modal API. Explore open-source or commercial options for local inference on powerful edge devices to minimize cloud costs and enhance privacy, particularly for sensitive user data. Your goal isn't to absorb massive token bills; it's to engineer around them.