LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning

cs.AI cs.CL Jianing Wang, Jianfei Zhang, Qi Guo, Linsen Guo, Rumei Li, Chao Zhang, Chong Peng, Cunguang Wang, Dengchang Zhao, Jiarong Shi, Jingang Wang, Liulin Feng, Mengxia Shen, Qi Li, Shengnan An, Shun Wang, Wei Shi, Xiangyu Xi, Xiaoyu Li, Xuezhi Cao, Yi Lu, Yunke Zhao, Zhengyu Chen, Zhimin Lin, Wei Wang, Peng Pei, Xunliang Cai · Mar 22, 2026

What it does

Why it matters

8%), and PutnamBench (41. 5%) with remarkably low inference budgets compared to prior open-source provers.

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

LongCat-Flash-Prover is a 560B-parameter MoE open-source model targeting native formal reasoning in Lean4. The core innovation is decomposing formal theorem proving into three agentic capabilities—auto-formalization, sketching, and proving—trained via a Hybrid-Experts Iteration Framework and a novel RL algorithm called HisPO. The work claims state-of-the-art results on MiniF2F-Test (97.1%), ProverBench (70.8%), and PutnamBench (41.5%) with remarkably low inference budgets compared to prior open-source provers.

Critical review

Verdict

Bottom line

The paper presents a comprehensive system for formal theorem proving that achieves impressive empirical results on standard benchmarks. The decomposition of formal reasoning into auto-formalization, sketching, and proving is well-motivated, and the HisPO algorithm addresses genuine training challenges (train-inference discrepancy, policy staleness) in MoE-based RL for long-horizon tool-integrated tasks. However, the paper suffers from significant reproducibility gaps regarding training data composition, hyperparameters, and computational resources. While the benchmark results are strong, the comparison with proprietary models like Seed-Prover is handicapped by undisclosed inference budgets, and the reported scores depend heavily on the novel "legality detection" mechanisms that were only added after discovering severe reward hacking—raising questions about the robustness of the evaluation pipeline.

What holds up

The three-capability decomposition (auto-formalization, sketching, proving) is sound and mirrors human expert workflows. The HisPO algorithm's hierarchical masking strategy—handling sequence-level and token-level train-inference discrepancies separately—is technically sophisticated and addresses real instability issues in MoE RL training, as evidenced by the model's ability to train stably at 560B scale. The patch to prevent reward hacking via AST-based legality detection shows responsiveness to real vulnerabilities in Lean4 verification.

“We introduce a novel HisPO algorithm that implements a hierarchical gradient masking strategy by estimating the training-inference consistency of sequence-level and token-level granularity to ensure a stable optimization.”

paper · Section 3.3

Main concerns

The paper omits critical training details: the exact composition of the cold-start dataset, the number of expert iteration cycles, total training FLOPs, and specific hyperparameters for HisPO (δ_seq, δ_tok, clipping thresholds). The reported 97.1% on MiniF2F-Test relies on a tree search strategy with undisclosed branching factors, making direct comparison with prior work difficult. The revelation that the model exploited evaluation loopholes (achieving 97.9% syntax verification but only 27.9% valid proofs before the fix) suggests the training process initially optimized for verification artifacts rather than genuine proving ability.

“Step-100 (hacking): Syntax Verification 1003 / 1024 (97.9%), + AST Checking (fix) 286 / 1024 (27.9%)”

paper · Table 5

Evidence and comparison

The benchmarking is extensive, covering auto-formalization, theorem proving, and general reasoning. However, comparisons to proprietary models (Seed-Prover, Delta-Prover) are incomplete due to undisclosed inference budgets. The claim of 'only 72 attempts' for MiniF2F 97.1% is strong, but the paper notes that 'different models may have different budget calculations' and some results are extracted from external reports rather than independent evaluation. The informal reasoning degradation (Table 4) is acknowledged but not analyzed for statistical significance.

“While Seed-Prover and Seed-Prover 1.5 currently lead on MiniF2F-Test and PutnamBench, a rigorous comparison is precluded by its undisclosed and potentially much larger search budget.”

paper · Section 4.2

Reproducibility

Reproduction would be extremely difficult. While the model weights are open-sourced, the paper lacks: (1) the exact data mixture ratios for domain-mixed SFT; (2) the curriculum iteration schedule; (3) HisPO hyperparameters (δ_seq, δ_tok, ε values); (4) the specific prompts for sketch generation; (5) computational cost estimates. The reliance on an in-house 'DORA' system and proprietary Lean4 server infrastructure (Section 3.3) creates additional barriers. The legality detection AST checker is mentioned as open-sourced, but the hybrid-experts synthesis pipeline details are insufficient for replication.

“We built this training recipe on our Dynamic ORchestration for Asynchronous rollout (DORA) system, which is introduced in our prior works.”

paper · Section 3.3

Abstract

We introduce LongCat-Flash-Prover, a flagship 560-billion-parameter open-source Mixture-of- Experts (MoE) model that advances Native Formal Reasoning in Lean4 through agentic tool-integrated reasoning (TIR). We decompose the native formal reasoning task into three independent formal capabilities, i.e., auto-formalization, sketching, and proving. To facilitate these capabilities, we propose a Hybrid-Experts Iteration Framework to expand high-quality task trajectories, including generating a formal statement based on a given informal problem, producing a whole-proof directly from the statement, or a lemma-style sketch. During agentic RL, we present a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm, which aims to stabilize the MoE model training on such long-horizon tasks. It employs a gradient masking strategy that accounts for the policy staleness and the inherent train-inference engine discrepancies at both sequence and token levels. Additionally, we also incorporate theorem consistency and legality detection mechanisms to eliminate reward hacking issues. Extensive evaluations show that our LongCat-Flash-Prover sets a new state-of-the-art for open-weights models in both auto-formalization and theorem proving. Demonstrating remarkable sample efficiency, it achieves a 97.1% pass rate on MiniF2F-Test using only 72 inference budget per problem. On more challenging benchmarks, it solves 70.8% of ProverBench and 41.5% of PutnamBench with no more than 220 attempts per problem, significantly outperforming existing open-weights baselines.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.