AlphaGo's MCTS: Turn Any Decision Into a Predictable Game

Key Takeaways

AlphaGo made the combinatorially complex game of Go tractable by using neural networks to guide a Monte Carlo Tree Search (MCTS) algorithm, a core breakthrough in AI decision-making.
The AI's decision process for each move revolves around an iterative, four-step MCTS: Selection, Expansion, Evaluation, and Backup, continuously refining its understanding of optimal actions.
AlphaGo employs a specific action-selection criterion called PUCT (Predicted Upper Confidence with Trees) during the Selection phase, balancing the exploration of new moves with the exploitation of known good ones.
Each MCTS "simulation" involves the neural network making a rapid, intuitive guess about the quality of a board state, effectively estimating future game outcomes. These evaluations then inform the AI's overall strategy through the Backup step.
Founders can adapt this "AlphaGo Monte Carlo Tree Search (MCTS) Four-Step Process" to dissect and optimize their own complex, multi-step decision trees, from sales funnel conversions to strategic product roadmaps.

The AlphaGo Monte Carlo Tree Search (MCTS) Four-Step Process

Eric Jang detailed AlphaGo's method for making incredibly complex search problems manageable. He described the core MCTS algorithm as an iterative loop designed to pick the best move in a game like Go. This isn't a pre-computed database of moves; it's an active, real-time construction of a decision tree, refined with every simulation.

This process is crucial because, as Jang explained, “AlphaGo's core conceptual breakthrough was using neural nets to make this search problem tractable.” The algorithm works by focusing computational effort on the most promising paths, guided by the neural network's "intuition."

Here’s how the four-step process works, verbatim from the episode:

Step 1: Selection: We're basically going to select the best action for this. When this root node is created, we also know that we can evaluate it under our neural network and get the quantities Vθ, as well as our probability over actions... The first step is to do the selection of the tree. Again, this is a very shallow tree. All we have so far is essentially a tree of depth one. Our first move is to select by maximizing, or argmaxing, the PUCT criterion, which is basically Q(s,a) + CPUCT x Pₐ x (√N / (1 + Nₐ)).
Step 2: Expansion: You get to this node and you realize it's not a leaf node. It's not a terminal game, so you cannot resolve the final resolution. The next step is expansion. You will then run this board state through the policy network.
Step 3: Evaluation: When we evaluate the node here, we're going to evaluate it from the perspective of this player. This node has possible actions that we could take, and we expand the leaf nodes here. For each of these nodes that we could arrive at, we're going to now check how good they are... We're basically using our neural network to make an intuitive guess of how good this board is from the perspective of this player. This is essentially a quick guess as to whether I'm going to win or not if I were to play to the end.
Step 4: Backup: The Q value assigned to the node here for taking this action is just the average across your evaluated values. You take a running mean over all the simulations you've taken, averaging the values of the children nodes. That's the backup step, and once you evaluate this, you can recursively go back up.

When This Works (and When It Doesn't)

Eric Jang highlights that this MCTS process is exceptionally effective for games with high combinatorial complexity, like Go. It makes seemingly impossible search problems tractable by intelligently building and pruning the decision tree, move by move. The method works best when you can clearly define states, actions, and an objective function (like winning or losing) that can be estimated by a "neural network" (or, in business, a reliable heuristic). The iterative nature means it continuously refines strategy, with final visit counts reflecting a well-honed policy distribution.

However, this approach runs into trouble when states or actions are poorly defined, or when the "game" has no clear end state to evaluate. It struggles if the "evaluation" step can't provide a reasonably accurate "quick guess" about future outcomes – imagine a problem where predicting the impact of a decision is almost pure guesswork. Also, if the "tree" is so wide that even intelligent exploration takes too long, or if the environment changes too rapidly between steps, MCTS can become impractical. It's built for problems where a clear, albeit complex, optimal path exists, and you have a mechanism to quickly estimate potential outcomes.

What to Do With This

Apply the AlphaGo MCTS process to optimize a key decision-making funnel in your startup, like converting a high-value lead. You're trying to find the best sequence of interactions to close the deal. First, clearly define your "states" (e.g., lead just opened email, lead asked a question, lead ghosted). Then, walk through the MCTS steps:

1. Selection: Identify the most promising next action for your lead based on their current state and your best judgment (your "PUCT criterion"). Is it a personalized follow-up email, a cold call, or a targeted demo? Pick the one with the highest estimated value.

2. Expansion: For the action you selected, brainstorm 2-3 most likely immediate responses from the lead. What happens if you send that email? Do they reply positively, reply with a question, or not reply at all?

3. Evaluation: For each of those potential responses, make a quick, honest estimate of its value. How much closer does that response get you to closing the deal? A positive reply gets a high score, a question gets a medium score, and no reply gets a low or negative score. This is your neural network's "intuitive guess" of the next state's value.

4. Backup: Average the estimated values of those potential responses. This average becomes the updated "Q value" for your initial action. If a particular action consistently leads to low-value future states, you learn to avoid it. Repeat this entire cycle multiple times for the same lead, simulating different actions and responses, until you have a clear "policy" for the best next move.