Revisiting Tree Search for LLMs: Gumbel and Sequential Halving for Budget-Scalable Reasoning
Standard AlphaZero-style tree search for LLM reasoning suffers from a scaling failure: on GSM8K and Game24, accuracy actually drops as the search budget increases beyond moderate levels. This paper introduces ReSCALE, which replaces PUCT selection and Dirichlet noise with Gumbel sampling and Sequential Halving—a best-arm identification technique from multi-armed bandits. The key insight is that root action-selection design is critical for budget-scalable reasoning without any changes to the model or its training.
The paper presents a well-motivated and technically sound solution to an important problem in LLM reasoning. The scaling pathology identified in AlphaZero-style MCTS is significant, and the proposed ReSCALE method effectively restores monotonic scaling through principled algorithmic changes. The ablation studies provide clear evidence that Sequential Halving is the primary driver of improvement, making the contribution focused and actionable.
The diagnosis of the AlphaZero scaling failure is compelling. The authors identify that standard PUCT-based selection overexploits a small subset of actions, creating visit count imbalances that worsen with larger budgets. The solution elegantly adapts Gumbel AlphaZero components—Sequential Halving for root selection and improved policy with mixed value approximation for non-root nodes. The mathematical formulation is precise: root actions are scored by $f^{i}=g(a^{i})+\log p(a^{i})$, with budget distributed via $N/\lceil\log_{2}M\rceil$ simulations per round. The non-root selection using $\pi^{\prime}(a)=\mathrm{softmax}(\log p(a)+\sigma(v_{c}(s_{t}\oplus a)))$ with mixed value approximation $v_{\text{mix}}(s_{t}\oplus a)=\frac{v_{\phi}(s_{t}\oplus a)+N_{\Sigma}\cdot\bar{v}_{t}}{1+N_{\Sigma}}$ shows careful attention to exploration-exploitation tradeoffs.
The experimental scope is narrow: only two tasks (GSM8K and Game24) on a single model (Llama2-7B). Whether the scaling failure generalizes to other reasoning domains, larger models, or different architectures remains unverified. The Best-of-N baseline performs surprisingly poorly (54.1% on Game24 vs 85.3% for ReSCALE), suggesting it may not be a well-tuned comparison—parallel work on PRMs and majority voting often achieves stronger results. The paper does not establish whether Sequential Halving helps because it reduces variance in root selection or because it interacts specifically with sentence-level action spaces in LLMs. The theoretical analysis of why lookahead pathology occurs in this specific setting (presence/absence of a learned value model, sparsity of rewards) is underdeveloped compared to the empirical fixes.
The evidence supports the core claim that ReSCALE restores monotonic scaling, but the comparison to related work is limited. The authors frame their work against Wan et al. [2024b]'s AlphaZero approach, which they replicate faithfully using the same experimental setup. However, the evaluation omits comparisons to contemporary tree-search methods for LLMs like Tree of Thoughts (Yao et al. 2023) with comparable budgets, or more recent inference-time compute scaling approaches. The Game24 task is a relatively small dataset (1,362 problems), making variance estimates potentially unstable despite multiple seeds. The value network training procedure—retaining only 17 unique completions per checkpoint—raises questions about value estimation quality and whether the observed improvements simply mask value approximation errors.
The authors provide code at an anonymous GitHub repository (https://github.com/CognitiveAISystems/ReSCALE), which facilitates reproduction. However, several key hyperparameters are underspecified: the temperature used for LLM sampling during tree expansion, the exact architecture of the value head added to Llama2-7B, and whether the Sequential Halving parameter $M$ (number of candidate actions at root) is fixed or adaptive. The value network training uses Monte Carlo return targets with MSE loss, but the learning rate and optimization details are omitted. The paper reports standard deviations across "hyperparameter configurations" but does not clarify which hyperparameters were varied or their ranges, making it difficult to assess whether the improvements are robust or depend on specific tuning for each budget level.
Neural tree search is a powerful decision-making algorithm widely used in complex domains such as game playing and model-based reinforcement learning. Recent work has applied AlphaZero-style tree search to enhance the reasoning capabilities of Large Language Models (LLMs) during inference, but we find that this approach suffers from a scaling failure: on GSM8K and Game24, accuracy drops as the search budget increases. In this paper, we present ReSCALE, an adaptation of Gumbel AlphaZero MCTS that replaces Dirichlet noise and PUCT selection with Gumbel sampling and Sequential Halving, restoring monotonic scaling without changes to the model or its training. ReSCALE reaches 58.4\% on GSM8K and 85.3\% on Game24 at budgets where the baseline degrades. Ablations confirm that Sequential Halving is the primary driver of the improvement.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.