Data-Free Layer-Adaptive Merging via Fisher Information for Long-to-Short Reasoning LLMs

cs.LG Tian Xia · Mar 23, 2026
Local to this browser
What it does
This paper tackles the Long-to-Short (L2S) model merging problem: combining a base LLM with a long-chain-of-thought reasoning model to preserve accuracy while drastically reducing output length. The core contribution is a theoretical...
Why it matters
The core contribution is a theoretical framework proving that merging error is bounded by the per-layer Hessian norm (Proposition 1), which motivates using the diagonal Fisher Information Matrix (FIM) as a data-free proxy for assigning...
Main concern
The paper makes a compelling case for layer-adaptive merging via a principled theoretical framework. The connection between Hessian bounds and Fisher Information provides the first rigorous justification for why uniform merging fails in...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

This paper tackles the Long-to-Short (L2S) model merging problem: combining a base LLM with a long-chain-of-thought reasoning model to preserve accuracy while drastically reducing output length. The core contribution is a theoretical framework proving that merging error is bounded by the per-layer Hessian norm (Proposition 1), which motivates using the diagonal Fisher Information Matrix (FIM) as a data-free proxy for assigning layer-adaptive merging coefficients. The resulting FIM-TIES method achieves state-of-the-art results on 5 of 6 benchmarks without requiring any domain-specific calibration data.

Critical review
Verdict
Bottom line

The paper makes a compelling case for layer-adaptive merging via a principled theoretical framework. The connection between Hessian bounds and Fisher Information provides the first rigorous justification for why uniform merging fails in L2S settings with large parameter distances. The empirical results are strong—particularly the 6.2 point gain on MATH500 and 92.6% length reduction—though some theoretical assumptions (local optima, small task vectors) sit uneasily with the reality of merging large, heterogeneously fine-tuned models.

“merging error is bounded by a term proportional to the per-layer Hessian norm”
paper · Proposition 1
“FIM-TIES achieves state-of-the-art performance on five out of six evaluation benchmarks, including a +6.2 point gain on MATH500 over ACM-TIES”
paper · Abstract
What holds up

The theoretical proposition (Proposition 1) is correctly derived and formally establishes that $\mathcal{E}(\alpha) \leq \frac{\alpha(1-\alpha)}{2} \cdot \|\delta\|_2^2 \cdot \sup_{t \in [0,1]} \|H_f(\theta_0 + t\delta)\|_2$, directly motivating the need for layer-adaptive coefficients. The empirical discovery that random-token FIM captures merging difficulty is well-validated: the max/min FIM ratio exceeds $1000\times$ across layers, providing strong discrimination without calibration data. The ablation study strongly supports the design choice, showing that weight-norm alone fails (producing near-uniform $\alpha^l \approx 0.53$) while the FIM-weighted product succeeds.

“the ratio of maximum to minimum FIM across transformer layers exceeds $1700\times$”
paper · Section 5.2
“Using the Frobenius norm of the task vector $\|\delta^l\|$ alone as importance signal produces near-uniform layer coefficients ($\alpha^l \approx 0.53$ for all $l$) and underperforms Task Arithmetic”
paper · Section 5.5
Main concerns

First, the Fisher-Hessian equivalence $\mathcal{F}(\theta^*) = -\mathbb{E}_x[H_{\log p}(\theta^*)]$ assumes the base model is at a local minimum (Sec. 3.2), which may not hold for pre-trained models. Second, the bound assumes small $\|\delta\|$ (Appendix A: 'valid for $\|\delta\| \ll 1$'), yet the paper notes L2S involves 'large parameter distances' (Introduction) with task vectors varying over $5\times$ across layers. Third, while random-input FIM works empirically, the paper provides no theoretical justification for why random tokens should approximate the data distribution at the base model—this is presented as an empirical discovery without principled motivation. Finally, the greedy decoding results on AIME24 actually trail ACM-TIES at 7B (26.7% vs 33.3%), with gains only materializing after self-consistency decoding, suggesting the merged model sacrifices some reasoning coherence for efficiency.

“At a local minimum $\theta^*$ of the negative log-likelihood, the Fisher Information Matrix $\mathcal{F}(\theta^*)$ equals the expected Hessian”
paper · Section 3.2
“FIM-TIES (greedy) ... 26.7 ... ACM-TIES ... 33.3”
paper · Table 3
“Dropping the higher-order term (valid for $\|\delta\| \ll 1$”
paper · Appendix A
Evidence and comparison

The evidence strongly supports the claims relative to baselines. Comparisons to Task Arithmetic, TIES-Merging, AIM, Sens-Merging, and ACM are fair—using identical model pairs (Qwen2.5-Math and DeepSeek-R1-Distill at 1.5B/7B) and evaluation protocols from the official Qwen2.5-Math toolkit. The FIM-TIES improvements over ACM-TIES (+3.9 at 1.5B, +6.2 on MATH500 at 7B) are statistically robust with std <0.3 across 4 seeds. The claim of being 'data-free' is substantiated against ACM's requirement for domain-specific calibration data. However, the comparison of computational overhead ('Low: 8 random forward+backward' vs ACM's 'High: corpus forward passes') understates that FIM requires backward passes which are more memory-intensive than ACM's forward-only activation statistics.

“Computational overhead ... Low (8 random forward+backward) ... High (corpus forward passes)”
paper · Table 1
“FIM-TIES results are averaged over 4 random seeds (std <0.3 on all benchmarks)”
paper · Table 2
Reproducibility

Experimental details are thorough: hyperparameters ($N=8$ random inputs, sequence length 64, temperature 0.3, seed 42), threshold ratios (0.2 for 1.5B, 0.4 for 7B), and the adaptive sharpness parameter $\theta_{\text{adapt}}$ are all specified. Error bars are reported for the 1.5B results. However, reproducibility is currently blocked by the unavailability of code—the checklist states 'Code will be released upon acceptance' and the paper mentions no public repository. Compute requirements (RTX 3090/4090, ~20-30 minutes CPU time for FIM computation on 7B) are documented, which would allow independent reproduction if the code were released.

“FIM hyperparameters. $N=8$ random inputs, sequence length 64, random seed 42”
paper · Section 5.1
“Code will be released upon acceptance”
paper · Checklist
“Experiments use NVIDIA RTX 3090 and RTX 4090 GPUs. FIM computation requires 8 forward+backward passes on the base model (approximately 20–30 minutes on CPU for 7B models)”
paper · Checklist
Abstract

Model merging has emerged as a practical approach to combine capabilities of specialized large language models (LLMs) without additional training. In the Long-to-Short (L2S) scenario, merging a base model with a long-chain-of-thought reasoning model aims to preserve reasoning accuracy while reducing output length. Existing methods rely on Task Arithmetic and its variants, which implicitly assume that model outputs vary linearly with the merging coefficient -- an assumption we show is systematically violated in L2S settings. We provide the first theoretical justification for layer-adaptive merging: we prove that merging error is bounded by a term proportional to the per-layer Hessian norm (Proposition~1), and establish that the Fisher Information Matrix (FIM) is a principled, computable proxy for this bound via the Fisher-Hessian equivalence at local optima. Building on this theory, we propose \textbf{FIM-Merging}, which computes diagonal FIM using only random token inputs (no domain-specific calibration data required) and uses it to assign per-layer merging coefficients. On the 7B L2S benchmark, FIM-TIES achieves state-of-the-art performance on five out of six evaluation benchmarks, including a \textbf{+6.2} point gain on MATH500 over ACM-TIES (90.2 vs.\ 84.0), while requiring no calibration data. On the 1.5B benchmark, FIM-TIES achieves an average accuracy of \textbf{47.3}, surpassing the previous best ACM-TIES (43.3) by \textbf{+3.9} points, while reducing average response length by \textbf{91.9\%} relative to the long-CoT model. Our framework also provides a unified theoretical explanation for why existing layer-adaptive methods such as ACM empirically outperform uniform merging.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.