Malde: Ditch Binary Rewards, Guide AI With Text via SDPO

Ronak Malde, co-founder of Trajectory.ai, faced a hard truth about building AI products: the models we train today are dead on arrival. Static. They don't learn from real-world interactions. In a world demanding constant adaptation, Malde argues that the prevailing training method for frontier models, Reinforcement Learning (RL), simply doesn't cut it for continual learning.

RL, in its simplest form, takes all the rich, complex user interactions and boils them down to a single reward number. “It could be a judge, it could be like something from production,” Malde explains, “and then you basically take that number and then use that to update and say like hey this entire trajectory was really good. This entire trajectory is really bad.” While this works for initial training, it’s fundamentally "broken" for a dynamic system that needs to adapt from nuanced feedback. The problem? "It's still taking all of this kind of useful information from the real world like I mentioned all the corrections and everything and putting it into just one number," Malde says.

Trajectory.ai's answer is Self-Distillation Policy Optimization (SDPO). It flips the script, moving beyond binary rewards to guide models with explicit textual feedback. Malde and his team have not only proven this in academic settings but scaled it to real-world applications, showing significant gains on benchmarks like Apex agents.

Key Takeaways

Traditional Reinforcement Learning (RL) crumbles in continual learning scenarios because it compresses rich, real-world user interactions into a single, insufficient reward number.
Self-Distillation Policy Optimization (SDPO) overcomes this by using "privileged information" to guide a "teacher" model, which then distills explicit textual guidance to a "student" model.
This method allows AI to learn from descriptive feedback, not just binary signals, leading to faster convergence and more robust, adaptive performance.
Unlike traditional distillation, the SDPO student model can actually be the smartest model; the teacher's role is enhanced by temporary privileged information to provide better guidance.
Trajectory.ai, co-founded by Ronak Malde, has successfully scaled SDPO to real-world use cases, demonstrating its practical advantages for building self-learning AI systems.

The Self-Distillation Policy Optimization (SDPO)

Here's how Malde's team at Trajectory.ai implements SDPO to make AI systems truly self-learning:

Student Rollout: The starting point, typically a less smart model, but in self-distillation, the student can be the smartest model. The student asks a question or attempts an action (e.g., 'how much is my flight ticket to New York?'), and the agent might look up some stuff and return wrong information.
Teacher Hint (Privileged Information): To make the teacher smarter, I'm actually going to take some privileged information or a hint and put that into context of the teacher, which is derived from hidden production information (e.g., the correct ticket information).
Match Student Log Probs to Teacher: Then we match the student log props to that teacher information.
Guiding the Model with Text: Suddenly we're able to take not just like a binary reward but truly like actual text and guide the model in that direction. So this is a huge unlock of STPO.

When This Works (and When It Doesn't)

SDPO shines precisely where traditional RL falls short: continual learning systems that demand granular, evolving feedback. If your AI product needs to constantly adapt from user corrections, explicit instructions, or dynamically changing data, and those insights are best expressed textually, SDPO is your answer. Malde's point is clear: when an entire user journey's complexity is reduced to a single reward number, critical learning opportunities are lost. SDPO thrives on that richness, allowing a model to learn from specific "corrections and everything" rather than a blunt numerical signal. It has been explored academically, but Malde confirms Trajectory.ai has scaled it for robust, real-world use cases, notably with Apex agents.

This framework may present challenges when the "privileged information" needed for the teacher is impossible or prohibitively expensive to obtain in real time. If your problem domain is inherently simple, and a basic numerical reward truly captures all necessary feedback (e.g., in some game environments or highly constrained control tasks), SDPO might be overkill. But for complex, human-facing AI that learns through language, SDPO offers a path to rapid, precise adaptation.

What to Do With This

If you're building an AI-powered sales assistant that helps your team craft custom pitches and needs to learn from successful (and failed) interactions, implement SDPO this week. Your current student model might be generating decent but generic pitches. For your next iteration, define a source of "privileged information" – maybe a human sales expert who manually corrects and refines a pitch, or a backend system that generates a perfect pitch based on CRM data. Use this as your Teacher Hint. Instead of simply telling the model "that pitch got a good score," Guiding the Model with Text would involve the teacher providing explicit feedback like, "Your pitch was good, but it missed the client's budget constraint mentioned in the pre-call notes. Focus more on cost-saving features." Then, Match Student Log Probs to Teacher to align your assistant's future outputs with this rich, specific guidance, turning every interaction into a deep learning opportunity.