Key Takeaways

  • Gemini 3.1's 50% price jump for context lengths over 200,000 tokens isn't arbitrary; it signals a hard constraint on memory bandwidth, not just compute, in underlying hardware.
  • Input (prefill) tokens costing up to 5 times less than output (decode) tokens indicates that the decode phase of LLM inference is heavily bottlenecked by memory bandwidth, not floating-point operations.
  • Pricing variations for cache hits and misses suggest large AI providers use a hierarchy of memory tiers (like HBM, DDR, Flash) where costs reflect the speed and duration data is held.
  • LLM API pricing acts as a transparent window into the physical, engineering, and operational constraints of running massive AI models in cluster environments.

The 200,000 Token Cliff: Why Context Gets Costly

Ever wonder why some LLM APIs charge you more per token for longer contexts? Reiner Pope, CEO of MatX, breaks it down. He points to public pricing, like Gemini 3.1's structure, where costs jump significantly above a certain context length. Dwarkesh Patel notes, “With longer context, Gemini 3.1 is 50% more expensive if you go over 200k tokens than if you're below 200k tokens.” This isn't a random decision. Pope explains it's a direct reflection of hardware limits.

“The primary things that limit you to really large contexts are memory bandwidth and memory capacity, which is exactly this effect,” Pope says. When you push past that 200,000 token mark, the model has to reach deeper into slower memory or move data more frequently, incurring a penalty. It’s not just about the raw computational power of the GPUs; it's about how fast data can move to and from them. This pricing tiered by context length is a signal that memory bandwidth, not pure FLOPs, becomes the bottleneck for large-context scenarios.

Prefill vs. Decode: The Memory Bandwidth Divide

Another striking insight from LLM API pricing comes from the difference between input (prefill) and output (decode) token costs. You often pay a fraction for input tokens compared to what you pay for generated output. Pope highlights this disparity: “The fact that they are charging 5x less for prefill than decode does suggest that they are bottlenecked on memory bandwidth to quite a degree.”

Here’s why: Prefill, when the model processes your prompt, can often be done very efficiently by spreading the work across many compute units in parallel. It's largely a compute-bound operation. Decode, however, is sequential. The model generates one token at a time, feeding that back into the input to generate the next. Each step requires pulling the entire model's state and context from memory, making it a memory-bandwidth-bound operation. The substantial price difference is a direct pass-through of the underlying hardware cost where memory access, not raw processing, dictates performance and expense for generating text.

Cache Hits and Multi-Tiered Memory

LLM providers manage vast amounts of data—model weights, user prompts, generated outputs, and various intermediate states. The cost of a cache hit versus a cache miss offers another peek behind the curtain. A cache miss means the necessary data isn't readily available and must be recomputed or fetched from slower storage. “A cache miss means you've deleted it from all your memories, and you have to recompute it from the tokens directly,” Pope clarifies.

This pricing structure hints that providers use different memory tiers, much like your computer has fast RAM and slower SSD storage. “I think this will probably end up being the drain time of the memory tier that you're in,” Pope speculates. This implies a strategic choice: if data needs to be held for a short period (e.g., during active inference), it sits in expensive, fast High Bandwidth Memory (HBM). If it needs to be held longer but accessed less frequently, it might move to slower, cheaper DDR memory, or even Flash storage or spinning disks for very long-term, cold storage. Each tier has an associated cost and a "drain time"—how long data can efficiently stay there before it's too expensive or too slow to retrieve.

What to Do With This

If you're building an application on top of LLM APIs, internalize that every token has a real-world hardware cost rooted in memory and bandwidth. For tasks requiring long contexts, consider strategies like summarization or retrieval-augmented generation to reduce token count and avoid those steep context length price cliffs. When designing prompts, optimize for output token efficiency, perhaps by constraining the model's generation length or using structured outputs to minimize verbose responses, as output (decode) tokens are your most expensive variable.