LLM Batch Size: 1000x Cheaper Inference or Crushing Latency?

Reiner Pope, CEO of MatX, on the Dwarkesh Podcast, offered a sharp lesson for anyone building on LLMs: your inference cost per token can be a thousand times worse if you ignore batch size. This isn't just about saving money; it's about making your product viable. Pope laid out the math behind the trade-off, showing how a smarter approach to batching fundamentally changes your unit economics.

Key Takeaways

Cost Skyrockets Without Batching: If you don't batch user requests for LLM inference, your per-token cost can be “a thousand times worse” than an optimized system. The issue is unamortized "weight fetches."
Latency vs. Cost is the Core Tension: Larger batch sizes reduce cost per token by sharing computational overhead across many users, but they inherently increase latency for individual requests. This moves inference from a memory-bound state to a more efficient compute-limited state.
Roofline Analysis Quantifies Trade-Offs: Reiner Pope uses a roofline analysis to explain how batching provides a lower bound on cost, demonstrating the shift from highly inefficient small batches to more cost-effective larger ones.
Sparsity Changes Everything: For Mixture-of-Experts (MoE) models, the "sparsity" (ratio of active to total parameters) directly impacts your optimal batch size. More sparsity means you need a larger batch size to achieve similar efficiency.
Calculate Your Optimal Batch Size: Use the Optimal LLM Inference Batch Size Estimation framework to find the sweet spot where memory bandwidth and FLOPs are balanced, setting a floor for your per-token cost.

The Optimal LLM Inference Batch Size Estimation

Reiner Pope shared a framework for understanding and estimating the optimal batch size needed for efficient LLM inference. This helps you balance the memory bandwidth of fetching model weights with the computational FLOPs for processing data.

Hardware Parameter (FLOPs/Memory Bandwidth): FLOPs over memory bandwidth being equal to batch size times number of active parameters, divided by the number of total parameters.
Sparsity Parameter: number of active parameters, divided by the number of total parameters
Optimal Batch Size Formula: batch size needs to be bigger than approximately 300 times sparsity.

When This Works (and When It Doesn't)

This framework helps you estimate a base optimal batch size for LLM inference, especially for MoE models where sparsity is a factor. Pope points to models like DeepSeek, where activating 32 out of 256 experts gives a sparsity of 8. Plugging this into the formula gives you a remarkably accurate ballpark for practice. Generally, Pope suggests going a little larger than the direct formula output—perhaps double or triple it. This accounts for real-world inefficiencies that a purely theoretical roofline analysis might miss.

This estimation works best when you have consistent traffic patterns that allow for batching. It becomes less effective, or even misleading, when your traffic is extremely bursty or individual user latency is paramount. If you have only a few sporadic requests, you'll inherently operate at smaller batch sizes, sacrificing cost efficiency for instant response. The framework reveals the cost of that low-latency choice.

What to Do With This

If you're launching an AI service and designing its pricing tiers or backend infrastructure, this framework is your first sanity check. Let's say you're building a content generation API powered by a DeepSeek-like MoE model. You activate 32 out of 256 experts.

1. Calculate Sparsity: Your sparsity parameter is 32 (active parameters) / 256 (total parameters) = 0.125.

2. Estimate Optimal Batch Size: Using Pope's formula, your batch size “needs to be bigger than approximately 300 times sparsity.” So, 300 * 0.125 = 37.5.

3. Adjust for Reality: Pope advises doubling or tripling this for real-world efficiency. So, aim for a batch size between 75 and 112.5.

This means if you're consistently serving fewer than 75 requests in a batch, you are paying significantly more per token than necessary. You might offer a "premium, low-latency" tier at batch size 10 (knowing it's expensive) and a "standard, cost-effective" tier that waits to build batches up to 75-100 requests. This gives you concrete numbers to define your backend queuing strategy and pricing, avoiding the "thousand times worse" economics Pope warns about.