Revisiting Tree Search for LLMs: Gumbel and Sequential Halving for Budget-Scalable Reasoning

cs.AI cs.LG Leonid Ugadiarov, Yuri Kuratov, Aleksandr Panov, Alexey Skrynnik · Mar 22, 2026

What it does

Why it matters

This paper introduces ReSCALE, which replaces PUCT selection and Dirichlet noise with Gumbel sampling and Sequential Halving—a best-arm identification technique from multi-armed bandits. The key insight is that root action-selection design...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

Standard AlphaZero-style tree search for LLM reasoning suffers from a scaling failure: on GSM8K and Game24, accuracy actually drops as the search budget increases beyond moderate levels. This paper introduces ReSCALE, which replaces PUCT selection and Dirichlet noise with Gumbel sampling and Sequential Halving—a best-arm identification technique from multi-armed bandits. The key insight is that root action-selection design is critical for budget-scalable reasoning without any changes to the model or its training.

Critical review

Verdict

Bottom line

The paper presents a well-motivated and technically sound solution to an important problem in LLM reasoning. The scaling pathology identified in AlphaZero-style MCTS is significant, and the proposed ReSCALE method effectively restores monotonic scaling through principled algorithmic changes. The ablation studies provide clear evidence that Sequential Halving is the primary driver of improvement, making the contribution focused and actionable.

“On GSM8K, AlphaZero accuracy drops at Large budget (53.6%) while ReSCALE continues to improve (58.4%)”

Ugadiarov et al., Table 1 · Section 3

“Removing Sequential Halving reduces accuracy by 4.7%, while removing Gumbel noise reduces it by 1.7%”

Ugadiarov et al., Ablation Studies · Section 3

What holds up

The diagnosis of the AlphaZero scaling failure is compelling. The authors identify that standard PUCT-based selection overexploits a small subset of actions, creating visit count imbalances that worsen with larger budgets. The solution elegantly adapts Gumbel AlphaZero components—Sequential Halving for root selection and improved policy with mixed value approximation for non-root nodes. The mathematical formulation is precise: root actions are scored by $f^{i}=g(a^{i})+\log p(a^{i})$, with budget distributed via $N/\lceil\log_{2}M\rceil$ simulations per round. The non-root selection using $\pi^{\prime}(a)=\mathrm{softmax}(\log p(a)+\sigma(v_{c}(s_{t}\oplus a)))$ with mixed value approximation $v_{\text{mix}}(s_{t}\oplus a)=\frac{v_{\phi}(s_{t}\oplus a)+N_{\Sigma}\cdot\bar{v}_{t}}{1+N_{\Sigma}}$ shows careful attention to exploration-exploitation tradeoffs.

“We analyzed MCTS trees for examples where AlphaZero MCTS produced incorrect answers and found that it overexploits a small subset of actions, leading to a large imbalance in visit counts between explored and under-explored nodes”

Ugadiarov et al., Method · Section 2

“$v_{\text{mix}}(s_{t}\oplus a)=\frac{v_{\phi}(s_{t}\oplus a)+N_{\Sigma}\cdot\bar{v}_{t}}{1+N_{\Sigma}}$”

Ugadiarov et al., Method · Equation sequence

Main concerns

The experimental scope is narrow: only two tasks (GSM8K and Game24) on a single model (Llama2-7B). Whether the scaling failure generalizes to other reasoning domains, larger models, or different architectures remains unverified. The Best-of-N baseline performs surprisingly poorly (54.1% on Game24 vs 85.3% for ReSCALE), suggesting it may not be a well-tuned comparison—parallel work on PRMs and majority voting often achieves stronger results. The paper does not establish whether Sequential Halving helps because it reduces variance in root selection or because it interacts specifically with sentence-level action spaces in LLMs. The theoretical analysis of why lookahead pathology occurs in this specific setting (presence/absence of a learned value model, sparsity of rewards) is underdeveloped compared to the empirical fixes.

“Best-of-N... 54.1±1.0%”

Ugadiarov et al., Table 1 · Section 3

“Ablations confirm Sequential Halving as the primary contributor”

Ugadiarov et al., Conclusion · Section 4

Evidence and comparison

The evidence supports the core claim that ReSCALE restores monotonic scaling, but the comparison to related work is limited. The authors frame their work against Wan et al. [2024b]'s AlphaZero approach, which they replicate faithfully using the same experimental setup. However, the evaluation omits comparisons to contemporary tree-search methods for LLMs like Tree of Thoughts (Yao et al. 2023) with comparable budgets, or more recent inference-time compute scaling approaches. The Game24 task is a relatively small dataset (1,362 problems), making variance estimates potentially unstable despite multiple seeds. The value network training procedure—retaining only 17 unique completions per checkpoint—raises questions about value estimation quality and whether the observed improvements simply mask value approximation errors.

“To construct the value-training dataset, we sample 100 completions per example... remove duplicates, and retain 17 unique completions per checkpoint”

Ugadiarov et al., Experiments · Section 3

“On GSM8K and Game24, ReSCALE exhibits a steady increase in accuracy as the compute budget grows. In contrast, AlphaZero's accuracy plateaus and even degrades”

Ugadiarov et al., Table 1 caption · Table 1

Reproducibility

The authors provide code at an anonymous GitHub repository (https://github.com/CognitiveAISystems/ReSCALE), which facilitates reproduction. However, several key hyperparameters are underspecified: the temperature used for LLM sampling during tree expansion, the exact architecture of the value head added to Llama2-7B, and whether the Sequential Halving parameter $M$ (number of candidate actions at root) is fixed or adaptive. The value network training uses Monte Carlo return targets with MSE loss, but the learning rate and optimization details are omitted. The paper reports standard deviations across "hyperparameter configurations" but does not clarify which hyperparameters were varied or their ranges, making it difficult to assess whether the improvements are robust or depend on specific tuning for each budget level.

“Code — https://github.com/CognitiveAISystems/ReSCALE”

Ugadiarov et al., Abstract · Code link

“For each budget level, mean accuracy ± std is computed across hyperparameter configurations”

Ugadiarov et al., Experiments · Section 3

Abstract

Neural tree search is a powerful decision-making algorithm widely used in complex domains such as game playing and model-based reinforcement learning. Recent work has applied AlphaZero-style tree search to enhance the reasoning capabilities of Large Language Models (LLMs) during inference, but we find that this approach suffers from a scaling failure: on GSM8K and Game24, accuracy drops as the search budget increases. In this paper, we present ReSCALE, an adaptation of Gumbel AlphaZero MCTS that replaces Dirichlet noise and PUCT selection with Gumbel sampling and Sequential Halving, restoring monotonic scaling without changes to the model or its training. ReSCALE reaches 58.4\% on GSM8K and 85.3\% on Game24 at budgets where the baseline degrades. Ablations confirm that Sequential Halving is the primary driver of the improvement.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.