AlphaGo Didn't Wait for Wins: It Learned Per-Move

Key Takeaways

AlphaGo's Monte Carlo Tree Search (MCTS) generates a "strictly better action" for every single move, offering immediate, local feedback, a stark contrast to the sparse rewards common in LLM reinforcement learning.
Large Language Models (LLMs) often rely on policy gradient methods, which Eric Jang likens to “sucking supervision through a straw” because rewards are delayed and only appear at the end of long, complex sequences.
This delayed feedback creates a "credit assignment problem" for LLMs: it's incredibly hard to pinpoint which specific action in a long chain was responsible for a good or bad outcome.
Instead of merely mimicking the single best MCTS action, AlphaGo trains its neural network to imitate the distribution of MCTS's improved choices, allowing for better generalization and continuous improvement.

Stop Sucking Supervision Through a Straw

Imagine trying to teach someone to play Go, but only telling them "good game" or "bad game" after 100 moves. That's essentially the problem many large language models face when learning through reinforcement, according to Eric Jang. On the Dwarkesh Podcast, Jang pulled back the curtain on why AlphaGo achieved such incredible sample efficiency, contrasting it sharply with the struggles of LLMs.

LLMs trying to learn complex tasks often use policy gradient methods. The issue? Rewards are sparse. You generate a long text, maybe a few paragraphs, and only at the very end do you get a positive or negative signal. “When Karpathy was on the podcast, he called it like 'sucking supervision through a straw',” Jang noted. This isn't just an inconvenience; it's a crippling problem for learning. How do you figure out which specific word or sentence in a 500-word output was the one that led to a good outcome, or a bad one? This is the infamous "credit assignment problem," and it makes learning incredibly unstable and slow. You might get a reward for "blue," but if your current policy never samples "blue," you'll never get that signal. As Jang put it, "If your policy has no chance of sampling 'blue,' then you will never get a signal. Exactly. That's modeled by the fact that your probability of sampling 'blue' is extremely low."

AlphaGo's Continuous Feedback Loop

AlphaGo's approach, built around Monte Carlo Tree Search (MCTS), solves this credit assignment problem with elegance. Instead of waiting for a win or loss at the end of an entire game, MCTS constantly evaluates and refines its understanding of the game move by move. For any given board state, MCTS explores future possibilities and then assigns a better, more informed value to the immediate actions. Dwarkesh Patel explained it well: “The reason it's much more preferable to do MCTS is because you can do it per move and make each move better, rather than having to learn per trajectory and hope, as Karpathy said, to learn this… Through a straw.”

This means AlphaGo isn't trying to attribute a win to a specific action from 50 moves ago. It's getting a "strictly better action" for every single move it considers. This provides a continuous, localized, and high-quality supervision signal. It's like having a master Go player whisper the ideal next move into your ear, every time it's your turn, rather than just telling you if you won or lost at the end of the game.

Training on Distributions, Not Just Actions

Here's where it gets even smarter: AlphaGo doesn't train its policy network to simply imitate the best action MCTS suggests. That would be too narrow. Instead, “In AlphaGo, you don't train the policy network to imitate the MCTS action. You train it to imitate the MCTS distribution,” Jang revealed.

What does that mean? MCTS doesn't just spit out one "correct" move; it generates a probability distribution over several strong moves, reflecting their relative likelihood of leading to a good outcome. By training the policy network to mimic this distribution, AlphaGo learns a more robust and generalizable strategy. It absorbs the nuances of different good options and their likelihoods, rather than becoming rigid and only knowing a single path. This prevents the policy from getting stuck if its preferred "best" action isn't available and helps it explore effectively, even in novel situations.

What to Do With This

Forget "win or lose" metrics for a moment. Look at your product, your team's process, or your own learning strategy. Where are you "sucking supervision through a straw" with delayed, sparse feedback? Instead of waiting for a quarterly review or a major product launch to see if something worked, build in continuous, local feedback loops. If you're teaching a new hire, provide "per-move" guidance on their tasks, not just an end-of-project debrief. For a new product feature, design micro-interactions that give you immediate signals on user engagement for that specific element, rather than just relying on overall retention numbers. Identify the "moves" in your system and figure out how to assign high-quality, continuous credit to each one.