Hiranmay Darshane

Dark Mode

Squint enough and RLing CoT reasoners is approximable as Monte Carlo Tree Search policy learning.

February 2026

Meta: Initial version written and shared privately: August 2025. Expansion and final release: February 26, 2026. Speculative synthesis. I could be wrong about many things.

Corporate needs you to find the difference in these two pictures. They're the same picture.

Having the ability to (learn to) succeed at any auto-regressive, continuous task can be formalised as: having 1) a good pre-trained policy to generate a finite set of legal moves (policy model), 2) an outcome estimator or verifier to pick moves (value model), 3) a search method that can recover from and outdo greedy choices by exploring alternative trajectories (tree search), and 4) a method to upweight good trajectories (credit assignment). Squint hard, and one realises that RL'ing CoT reasoners satisfies all four.

Table of Contents

Introduction: Why scale "inference-time compute" at all?

At some point in 2024, it became obvious to many that the next paradigm of training frontier LLMs after the overwhelming success of internet-scale next token prediction and human-preference post-training is having models spend more "resources" (some surrogate examples: wall-clock time spent at inference, tokens, compute, etc.) on problems adaptively. It's intuitive and obvious. Much more expensive inference-time decoding or "thinking" is warranted by a question like "Solve the Riemann hypothesis" than "What's the capital of France?"

The best precedent for this comes from board games. AlphaGo3 and AlphaZero4, instead of being models tasked to one-shot a move, were systems that spent compute at inference time running MCTS, i.e. simulating thousands of trajectories from the current board state before committing to a single move. The insight, in my opinion, is that choosing the likeliest completion is a greedy operation, and the local correctness of a greedy choice correlates poorly with long-horizon correctness. Some form of exploration, i.e. looking into the future at how well moves fare, is definitely necessary. One can argue this is what leads to Move 37-like phenomenon. Where the model is planning at a long effective horizon.

Andy L. Jones (2021) quantified the train/test compute tradeoff directly in board games.1 The result was such: each additional 10x of train-time compute was substitutable by roughly 15x of test-time compute, and the relation was linear in log-compute, meaning it held across greatly varying degrees of resources.

The question then becomes: what is the right approach to spend more compute at inference-time for LLMs? Especially when LLMs operate within open-ended language workloads that are far from the perfect-information zero-sum games that made MCTS so cleanly tractable?

Naively, there are multiple ways. One could use model-level measures like entropy or varentropy2 to resample uncertain tokens until they fall below certain thresholds. One could use best-of-n with a majority voting scheme to parallelise samplings across multiple instances. But these are partial fulfilments.

The red-herring that is MCTS, and what is MCTS anyway?

MCTS worked amazingly well for the aforementioned games. Let's be precise about what an MCTS setup actually consists of. The rough mapping matters later.

The policy network takes a board state as its input and outputs a probability distribution over legal moves. Intuitively, its job is to constrain the search space instead of branching into every possible legal move. Thus, you only seriously consider the ones the policy assigns high probability to. Intuitively, this is akin to pruning the search tree's breadth.

The value network takes a board state as input and outputs a simple scalar: the estimated probability of winning from that position. Its job is to evaluate positions without having to play out the full game to a terminal state. This is just pruning the search tree's depth.

The search procedure itself (MCTS) uses the policy network to decide where to look and the value network to decide how good the moves that it found happen to be, thus integrating both signals across many simulated trajectories to converge on a single move for each board state.

Finally, Credit assignment closes the loop. When a simulated trajectory reaches a terminal state, that signal propagates back up the tree, updating the value estimates of every node that was visited. Good subtrees get higher value estimates, bad subtrees get lower ones, and future search is biased accordingly. It's worth noting that credit assignment within the tree itself during inference-time and gradient backprop in the networks are distinct operations.

Note that AlphaGo and AlphaZero still operate on the actual game tree i.e. the rules are known already. The "world model" or the state is fully explicit. Meanwhile MuZero5 got rid of that explicit scaffold: it learnt a latent world model and runs the entire search in representation space, never actually touching the real game state directly. The tree search would happen over learned representations of positions, not symbolic board states. This is probably a closer analogue to what CoT reasoners do, there does exist an oracle to learn from each rollout, but at least at inference-time there is no ground truth game tree. Worth noting that MuZero never had trained priors unlike LLMs and learnt everything via self-play.

In the context of LLMs, one can recall that people in 2024, up until o1-preview or even until r1-preview, suggested that the way ahead for scaling inference-time compute in LLMs is something similar to explicit MCTS: i.e. you task the LLM to generate candidate completions, train a reward model to evaluate position strength, and search to converge on singular moves. PRMs and ORMs were speculated to be this reward model. OpenAI had a paper in early 2023 called "Let's Verify Step by Step"6 that made use of process supervision.

As an aside, I want to specially mention entropix that had particular memetic appeal and capture just around that time (August–September 2024), entropix took the approach of deciding to parallelise token-level completions at each state with >x varentropy within the logits.

Coming back to PRMs and ORMs though, this is really where the problem begins. Consider a set of 500 words that would make perfect semantic sense to follow a given prompt, there is just no way to run 500 completions at each token-selection step, for each token-selection step till termination, compared to, say, ~35 legal moves in chess. Take away the "semantically sensible" filter and LLMs have a vocabulary size's worth of possible completions. The depth and breadth are just too big. So the question is: can you get MCTS-like behavior without the explicit tree?

LLMs and The CoT-MCTS Equivalence

A good realisation, obvious in hindsight, is that the mental models of many people overfit(ted) on literal MCTS, i.e. one pictures a decision tree splitting, things being tried till-termination, etc. without realizing that, taken at the limit, a good chain-of-thought does all of this. Think scheming moves from current state, evaluating probable options, committing a move, assigning credit, etc. With a much terser policy that also happens to exist as an extremely strong prior.

Consider this CoT reasoning pattern:

"Let me think about this step by step. First, I could try approach A… [expansion]. Actually, that seems problematic because… [evaluation]. Let me try approach B instead… [backtracking/selection]. This looks more promising because… [value estimation]. Continuing with B… [commitment]."

Let's try and see how the four MCTS components fit here.

Policy network → the model generating the next reasoning step. The model's token-completion distribution, shaped by pretraining, SFT, RLHF, etc., constrains which reasoning moves get explored. Strong priors here are essential, and this is why outcome-based RL didn't work before r19 and o1. The base model had to be strong enough to occasionally sample coherent reasoning traces before the RL signal had anything to reinforce. You're prospecting for, and looking for circuits that already exist. A big factor here was probably SFTing on a crazy number of expert-generated STEM CoTs.

Value network → the model recognising dead ends and evaluating its options. "Actually, that seems problematic because…" is intuitively the model evaluating position strength in natural language rather than as an explicit scalar. Worth noting that unlike in AlphaGo, where the value network is a separate component, in a CoT reasoner, it's the same model doing a different thing at a different moment in the sequence, obviously dynamically switching between policy and value networks. One wonders if there's some advantageous mutual information type of thing going on here.

As an aside, this is cleanly connecting to all the "generator-verifier gap" thing people talk about all the time. The simple intuition is that LLMs are better verifiers than they are generators. Thus one may argue that the value network was much easier to bootstrap using the existing language priors unlike the policy network which required painstaking work to make stronger base models.

MCTS → the overall CoT process that is orchestrating exploration, evaluation, and commitment across the reasoning trace. Of course, like MuZero, the search happens over learned representations of reasoning states (that map onto the tokens that the model is producing), and not over any explicit symbolic structure. This is the unexpected consequence that most people were surprised about emerging from just simple clean RLVR. It is a satisfying realisation that models are naturally incentivised to do all of this when tasked with maximising correctness on STEM tasks.

Credit assignment → GRPO.7 The RLVR part's core is Group Relative Policy Optimisation (GRPO). To get a minimum viable intuition, one may think of GRPO as an algorithm that reinforces/upweights rollouts/completions to a certain question/prompt, and upweights those relative to how the other rollouts in the same group fared. Just to contrast, AlphaGo was trained via self-play after an initial supervised learning phase. Self-play would update the policy network. AlphaZero and MuZero on the other hand were completely trained via self-play by jointly updating the value and policy networks.

The unreasonable effectiveness of GRPO, and closing the RL loop

GRPO propagates the outcome reward back through the token sequence to update the model. The function here is somewhat summarisable as: turn terminal outcomes into updated beliefs about which intermediate steps were good. This is kind of unintuitive, a common objection with GRPO is that tokens in a correct rollout all get an identical gradient. The token sequence that says "hey, this doesn't really work" gets treated the same as the token sequence that says "the final answer is xyz." This seems obviously wrong, one step definitely mattered more.

The two big possible confusions here are: how does 1) GRPO update specific strategy circuits, and 2) why does a tree search like CoT emerge naturally anyway?

On question 1, i.e. how does GRPO update specific strategy circuits? The gradient at the token level is just the surface reflection of the LLM's entire forward pass. Intuitively, what's actually happening is that GRPO is running a comparison across a group of rollouts and asking: which circuits are systematically more active in the high-reward rollouts than the low-reward ones? Those circuits get upweighted. A rollout that switched strategies early and succeeded has the "recognise dead end, pivot" circuit active. Even when rollouts diverge early, what differentiates better from worse ones are higher-order properties like strategies, reasoning subsequences, etc. Across thousands of such comparisons, GRPO ends up finding the circuits that best correlate with success and reinforces them, not by actually labelling individual tokens or strategies as good or bad, but by correlating activation patterns with outcomes.10 There is recent work like GSPO, etc. which takes a modified different approach, but that remains out of scope for this discussion.

On question 2, i.e. why does tree search emerge naturally? I think this is simply because the model is being trained to maximise the probability of correct outcomes over a distribution of problems, and for hard problems, single-attempt linear reasoning reliably fails. The model discovers through outcome signal alone that traces which contain exploration, self-evaluation, and backtracking tend to score better than traces which don't. It's just a more comprehensive search or exploration over the entire solution space. These comprehensive searches end up succeeding, and via GRPO the model gets told that certain rollouts were better, and what resembles a tree search is what those rollouts had in common. The behavior emerges because it's crucial for the reward.

It thus becomes intuitive that taken at the limit, and if one knows how to squint well, given a model with strong enough priors to meaningfully constrain the search space and a sense for what approaches are promising, a CoT quickly starts mirroring an MCTS-like search process.

If you squint further, the Analogy Breaks (And that probably Doesn't Matter)

In board games, the policy distribution over legal moves is dense and well-constrained; MCTS works precisely because the branching factor is manageable. LLM token distributions have long tails. The point of RL on CoT reasoners is exactly to upweight the right long tails, collapsing them toward the dense, well-constrained distributions that the model benefits from containing.

The value network, in my opinion, is the most visibly different piece. The policy models and the actual search map very cleanly, but this does not. In MCTS the value network is explicitly trained on ground truth position evaluations. Here it's implicit. People have all sorts of things to say here like "the value network is within the model as god intended," etc. signalling that it's implicit and it's good enough. Obviously if you believe in those generator-verifier gap discussions, this is appealing to you. I just want to say it is visibly different, even if you try to map it to the LLM evaluating its own strategies. Sometimes these evaluations can be externalised to an online verifier-type environment, i.e. an LLM can obviously verify its calculations using a deterministic math environment for a sub-problem before proceeding ahead, instead of that self-evaluation. I don't think this is a very big thing, I just think the difference is biggest here.

Additionally, MCTS converges provably in perfect-information zero-sum games. With LLMs, a CoT search procedure doesn't.

I want to stress on the one place where the implicit version of the tree search procedure is strictly stronger than the explicit one. In MCTS, each branch is isolated information-wise, when you abandon branch A and explore branch B, branch B has no access to what branch A discovered. The failed attempt influences the tree only through the value update on the parent node. In CoT reasoning, the entire trace is in context. Approach B is conditioned on everything approach A tried and failed at. Models being good at in-context learning obviously helps here, even if the forward pass is just their past thoughts. The model can avoid the same dead ends, reuse partial insights, notice structural similarities. This is a strict informational advantage that no explicit tree search has, and it compounds on problems where early failed attempts contain useful signal about the solution space.

What the Four Components Inform in a Research Selection sense

Policy model → the quality of your prior determines the ceiling of your search. If the base model can't generate coherent reasoning moves, GRPO has nothing to reinforce. This predicts that SFT on high-quality, diverse reasoning traces should have outsized returns. We obviously know this to be true. The best of RLVR practitioners often stress on "SFT Warmups" before beginning RLVR. A model that has seen one way to approach a problem will search one subtree. A model that has seen ten will search ten. I think all of this very naively and loosely also predicts that using fixed special tokens to mark search operations, say backtracks, splits, exploration transitions, etc. could bind specific circuits strongly to those operations, producing cleaner and more compositional representations of search behavior. I am unsure about how well this would work in reality.

Value model → How do you get better at evaluation? We should believe in the idea that greedy selections at the token-level don't always correlate well with correctness. So let's say a model generates good reasoning moves in a general sense of having good heuristics about branching, exploration, etc. but spits out inaccurate steps very often and can't tell which ones are promising searches. This predicts that any intervention improving intermediate evaluation will have compounding returns, the model might decide to take n steps after the concerned step, and that may go well or bad depending on the model's evaluation of that step. This means it is probably worth looking into training models to use online feedback at inference-time: something like looking up a historical fact to verify it thoroughly before using it to reason about something else, or verifying the symbolic equivalence or the axiomatic validity of an obtained equation before modifying it into other forms. This remains an important direction, and ties in very cleanly with tool-use efforts.

This also ties in with the research direction that I am personally the most excited about, i.e. the deeper, training-time fix is to teach the model verifier abilities directly by training it on (action → evaluator feedback) sequences so it learns to better evaluate intermediate positions, not just generate moves. A model trained this way internalizes the evaluator: at test time, its own self-critique becomes feedback it has already learnt to use in pre-training. The model that learned to read evaluator outputs learns to generate and use its own. Inference-time feedback becomes more valuable the better the model is at using it. It is sub-optimal to just get feedback, one can go beyond and try maximising P(success|feedback) by training models to succeed at verifiable tasks with pre-conditioned and interleaved feedback. Meta had a paper (CWM: An Open-Weights LLM for Research on Code Generation with World Models)12 that went under-the-radar, training a model on large-volume observation-action trajectories from Python environments and ending up realising large benchmark gains.

At the risk of digressing too much, it does seem intuitive that having a model incentivised to have two explicit "try" and "reflect" modes would help in better coverage of solution spaces at both training and inference-time, while also benefitting from high mutual information between the two modes. I'm looking too speculatively in the future, but one may imagine a day where multi-agent systems become the norm and improving your "Bayesian-ness" of updating on new given external feedback starts mattering more.

Tree search → the value of inference-time compute scales with search quality. Naive best-of-n is a flat, unstructured tree with all branches at depth 1, and no deeper exploration. This predicts that structured search, where later rollouts are conditioned on what earlier ones found, should dramatically outperform naive sampling at the same compute budget. The open problem is the integration mechanism: how do you take what one reasoning trace discovered and use it to seed a better one. I saw another interesting paper that went under-the-radar a few months ago on adaptive parallel reasoning models13 that tries to tackle this.

Credit assignment → the precision of your reward signal determines what the model learns. Outcome-only RL tells the model that some trajectories were better than others. We made the case as to why that does lead to circuit-level reinforcement, but this is implicit and one would imagine there is some structure to be exploited. MRT (Qu et al., 2025)11 takes this structure seriously. It introduces differential credit within a single rollout by segmenting the reasoning trace into episodes and asking at each prefix: if I stopped thinking right now, what's my probability of getting the answer correct? The delta between episode j and j+1 is the progress bonus. After episode 1, force-terminate thinking and sample completions, say 20% success rate. After episode 2, same thing, 60% success rate. The strategy switch gets +0.4, the initial approach gets +0.2. The model learns that recognising a dead end and switching is the high value move, not just that the rollout eventually succeeded.

The implementation is also telling. To estimate the value of an intermediate state, MRT runs rollouts from that state, reminiscent of how AlphaGo trained its value network. To train a model to do implicit tree search, you run explicit tree search inside each training step. The explicit and implicit keep collapsing back into each other.

Bootstrapping stronger priors A trajectory can be upweighted only if it exists, and thus a non-zero pass rate must exist for a given task. Developing strong priors on extremely difficult tasks remains an important direction. Aviral Kumar's lab has interesting work in this vein. 14

The Endgame

The advantages of explicit tree search and implicit CoT search are complementary and cover each other's weaknesses almost perfectly. CoT has: information mixing across attempts, continuous branching, compression via language, self-conditioning on uncertainty. But it's a single sequential trace with no true parallelism and no systematic exploration. MCTS has: systematic exploration, parallelism, explicit value estimates at each node. But branches are informationally isolated, and the world model has to be given or learned separately.

The endgame is a system that searches with CoT-like representations that carry the full history of what's been tried, so branches aren't informationally isolated, with systematic parallel exploration across discrete candidates (say something like Lean tactics, at least in verifiable domains), and uses deterministic environmental signals wherever possible, which it is ideally trained to maximise its gains from.

Corporate needs you to find the difference in these two pictures. They're the same picture.

References

Jones, A.L. (2021). Scaling Scaling Laws with Board Games. arXiv:2103.05045. ↩
xjdr-alt. entropix. github.com/xjdr-alt/entropix. ↩
Silver, D. et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature. ↩
Silver, D. et al. (2017). Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. ↩
Schrittwieser, J. et al. (2020). Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. (MuZero). ↩
Lightman, H. et al. (2023). Let's Verify Step by Step. OpenAI. arXiv:2305.20050. ↩
Shao, Z. et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300. ↩
Coulom, R. (2006). Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search. Computers and Games.
Guo, D. et al. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948. ↩
Anthropic (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. transformer-circuits.pub. ↩
Qu, C. et al. (2025). Meta Reinforcement Fine-Tuning. arXiv:2503.07572. Project page. ↩
FAIR CodeGen Team, Meta (2025). CWM: An Open-Weights LLM for Research on Code Generation with World Models. arXiv:2510.02387. ↩
Pan, R. et al. (2025). Adaptive Parallel Reasoning. arXiv:2504.15466. ↩
Aviral Kumar. aviralkumar2907.github.io. ↩