Structural Sensitivity in Compressed Transformers: Error Propagation, Lyapunov Stability, and Formally Verified Bounds
This paper investigates why compressing different weight matrices in transformers leads to wildly different outcomes—from negligible impact to 20,000× perplexity increases. The authors map this structural sensitivity across five architectures, revealing that early-layer MLP up-projections are catastrophically fragile while value projections are nearly free to compress. Using Lyapunov stability theory, they explain how residual connections contract errors, and they provide machine-checked formal bounds in Lean 4 to guarantee per-matrix approximation quality.
The paper offers a compelling integration of empirical sensitivity analysis, dynamical systems theory, and formal verification. The claim that "stability is necessary but not sufficient" for compression robustness is well-supported: Qwen3-8B ($\rho_{\max}=1.33$) collapses at 42,922× degradation while Mistral-7B ($\rho_{\max}=1.14$) degrades only 11×, yet GPT-2 Small ($\rho_{\max}=0.96$) degrades 120× despite fully contracting dynamics. However, the formal guarantees are intentionally scoped to linear matrix multiplications, leaving the full nonlinear dynamics (GELU, softmax) analyzed only empirically.
The structural sensitivity hierarchy is robustly demonstrated: "early-layer MLP up-projections are catastrophically sensitive while value projections compress nearly for free" with a consistent ordering (mlp_fc $\gg$ Q $>$ K $>$ mlp_proj $>$ attn_proj $>$ V) spanning five orders of magnitude. The Lyapunov analysis is novel and well-supported, showing that residual connections contract errors via hidden state growth ($\alpha$) exceeding error growth ($\beta$), yielding $\rho_{\max}=0.96$ for GPT-2 Small. The formal verification is rigorous: ten Lean 4 theorems with no sorry markers validate per-matrix bounds, and the additive composition "no multiplicative interaction between stages" enables independent tuning.
The primary limitation is the scope of formal guarantees. As noted in Section 7, the framework "proves bounds on individual matrix multiplications (which are exactly linear), not on full transformer layers including GELU, softmax, and LayerNorm." While the empirical validation shows zero violations across 14,040+ configurations, the contraction factor $\rho$ is "measured empirically rather than derived analytically from the weight matrices alone." Additionally, the practical compression results (24% FLOPS savings at +27% perplexity) lag behind state-of-the-art methods like SparseGPT, though the authors correctly position their contribution as structural characterization rather than raw compression performance.
The evidence strongly supports the central claims. The five-orders-of-magnitude sensitivity range is demonstrated by the finding that "compressing layer 0's MLP up-projection alone—one matrix out of 468—increases perplexity by 20,000×." Cross-architecture validation across 117M–8B parameters confirms hierarchy stability across WikiText-103 and C4 at scales from 2K to 51K tokens. Comparisons to related work are fair: the authors acknowledge that SparseGPT achieves better absolute compression but note their contribution is explaining "why non-uniform allocation is essential and which components to protect." The formally verified bounds distinguish this work from prior empirical compression methods.
Reproducibility is strong on the formal side but demanding on the empirical side. The "ten machine-checked Lean 4 theorems formalize per-matrix error bounds with no sorry markers" and all produce "zero violations across 14,040+ configurations," confirming conservative bounds. The supplementary material includes LivingInference.lean and Python implementations. However, reproducing the full cross-architecture sweep (including 8B models like Mistral-7B and Qwen3-8B) requires substantial compute. Hyperparameters are detailed (sparsity levels 1–50%, ranks 4–256, token counts), though the exact硬件 specifications and runtime are not specified.
A single matrix out of 468 in GPT-2 Small can increase perplexity by 20,000x when compressed, revealing that transformer compression sensitivity spans five orders of magnitude. We map this sensitivity landscape across five architectures (117M-8B parameters), finding a consistent hierarchy: early-layer MLP up-projections are catastrophically sensitive while value projections compress nearly for free. This hierarchy is stable across compression levels, evaluation scales (2K-51K tokens), and datasets (WikiText-103, C4). Using Lyapunov stability theory, we show that residual connections contract compression errors by growing the hidden state faster than the error. Error contraction is necessary but not sufficient for compression tolerance: architecture-specific redundancy plays an equally important role, as demonstrated by the hybrid LFM2-2.6B degrading only 7x despite higher amplification than the fully-contracting GPT-2 Small (120x). Ten machine-checked Lean 4 theorems formalize per-matrix error bounds with no sorry markers; all bounds produce zero violations across 14,040+ configurations. We validate with downstream task evaluation (HellaSwag, ARC-Easy, Winogrande), activation-aware pruning on two architectures, and a Compression Fragility Index that rank-orders model robustness.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.