Nebius Token Factory: Cut AI Inference Costs by 70% with Open-Source

Key Takeaways

Enterprises often hit a wall trying to deploy promising open-source AI models, despite their potential for lower costs and greater flexibility. The "plumbing" of optimization and deployment is complex.
Nebius Token Factory offers managed inference, abstracting away this complexity and making open-source models viable for production workloads.
Through techniques like model distillation, caching, and spec decoding, Token Factory can reduce AI inference costs by up to 70% compared to frontier models.
Companies like Revolut, which once spent 99% of their AI budget on closed models, are now shifting to open-source by building internal experimentation engines.
The real prize isn't just cheaper inference, but the ability to quickly tune and evolve models with proprietary data, accelerating product growth for specific use cases.

The Method: Making Open-Source AI Inference Work for Real Businesses

Many founders eye open-source AI models as a path to lower costs and greater control, believing they can simply download weights from Hugging Face and watch the magic happen. Roman Chernin, co-founder of Nebius, pulls back the curtain on this fantasy. As he puts it, you might read that “there are a lot of great open source models that on the benchmarks are close to open AI…and you think oh great it will be 10 times cheaper inference is cheaper.” The reality, Chernin says, is that you “take the weights…and then it doesn't work.”

The problem isn't the models themselves; it's the operational headache—the unseen "plumbing"—of making them reliable, cost-efficient, and scalable for enterprise use. Businesses demand reliability and ease, but deploying open-source models requires deep expertise in optimization. Nebius Token Factory steps in to bridge this gap, offering a managed inference platform specifically designed for open-source and specialized models. This means enterprises can run existing vanilla open-source models or deploy their own fine-tuned weights, while Nebius handles the backend complexity.

Chernin details the specific optimization techniques that drive significant cost savings. These include model distillation, which creates smaller models that maintain the same quality, alongside sophisticated caching strategies and spec decoding. By applying these methods, Token Factory can deliver up to a 70% reduction in inference costs. This isn't just about saving money; it’s about enabling rapid model evolution and supporting a wider range of use cases that were previously uneconomical with expensive, closed-source models.

The Revolut case study illustrates this shift. Initially, 99% of their inference budget went to closed models like OpenAI. As Chernin explains, they started to crack use cases where these models weren't economically viable. By building an internal experimentation engine to tackle the "cold start problem" of open-source deployment, Revolut began adopting open-source solutions exponentially. This allowed them to tailor AI to their specific product needs, fostering growth at a pace comparable to AI-native companies.

Where This Breaks Down

While Token Factory's approach offers compelling advantages, it's not a silver bullet for every scenario. This method works best when you have clear, defined use cases where open-source models, potentially with fine-tuning, can meet or exceed performance requirements. It might be less immediately applicable for companies exploring extremely novel, frontier AI capabilities that, for now, are exclusive to the largest closed-source models and their proprietary data. Additionally, while Nebius abstracts away much of the deployment complexity, businesses still need to invest in understanding which open-source models best fit their specific data and product needs. The promise of cost savings and tunability comes with the responsibility of identifying the right model-to-problem fit.

What to Do With This

Take action this week: Audit your current AI inference budget. Identify at least one specific product feature or internal workflow currently relying on an expensive, closed-source model that could potentially be served by a smaller, specialized, or open-source alternative. Instead of an all-or-nothing switch, prototype this single use case with a managed open-source inference solution like Nebius Token Factory. Focus on proving the economics and performance for that specific use case and quantify the cost savings and improvements in model control. This focused experiment can be the "cold start" for broader adoption, just like it was for Revolut.