Beyond Strict Pairing: Arbitrarily Paired Training for High-Performance Infrared and Visible Image Fusion
This paper challenges the long-held assumption that infrared and visible image fusion (IVIF) requires strictly paired training data. The authors propose UnPaired and Arbitrarily Paired Training Paradigms (UPTP and APTP), demonstrating that pixel-level self-supervision enables training on unaligned cross-modal combinations. By reformulating the maximum likelihood objective to treat infrared and visible images as independent variables, they show that a base dataset of $N$ pairs can be expanded to $N^2$ trainable combinations, potentially reducing collection costs while improving generalization.
The paper presents a compelling theoretical and empirical case for relaxing the strict pairing constraint in IVIF. The formulation of APTP as a superset containing both SPTP and UPTP (Eq. 14) is analytically elegant, and experiments across CNN, Transformer, and GAN architectures demonstrate that arbitrarily paired training achieves comparable metrics to strictly paired training on 100× larger datasets. The adaptive weighting mechanism $W(a;a,b)$ effectively handles cross-modal pixel selection without requiring spatial alignment.
The core insight—that pixel-level losses (intensity, gradient, SSIM) supervision can be computed from arbitrarily paired sources without content consistency—is validated convincingly. The adaptive loss formulation using $W(a;a,b)=\frac{a}{a+b}$ allows dynamic blending ratios without hand-crafted fusion rules. Cross-dataset experiments (Table 2) provide strong evidence that the model learns content-independent relationships, achieving superior MI and VIF scores when training on mismatched datasets (M3FD infrared + MSRS visible) compared to single-dataset SPTP baselines.
The claim of 100× data reduction is misleading: APTP with 150 base pairs generates 15000 combinations through recombination, but the model still only accesses 150 unique images per modality. This is data augmentation via permutation, not collection reduction. The theoretical independence assumption $p(x_i^{ir},x_j^{vis})=p(x_i^{ir})\cdot p(x_j^{vis})$ (Eq. 12) treats positional correspondence as irrelevant, yet the paper provides no analysis of failure modes when scenes are drastically different (e.g., indoor vs outdoor). Table 4 shows the proposed method outperforms SOTA, but uses cross-dataset training (MSRS+M3FD) while competitors use single datasets, confounding the comparison. The GAN baseline shows minimal improvement over SPTP (Table 1), suggesting the paradigm may not benefit all architectures equally.
The comparison to SOTA methods in Table 4 demonstrates competitive quantitative metrics (EN, MI, VIF) but lacks fair controls—baseline models are trained on combined cross-datasets while competitors use their original protocols. The paper correctly identifies that UPTP achieves 'nearly identical' performance to SPTP with 150 pairs, but this sample size is already substantial; the claim of handling 'severely limited' data would require validation on <<50 pairs. The theoretical framing that APTP encompasses SPTP is sound, but experiments do not isolate whether performance gains stem from increased training iterations on synthetic pairs versus genuine diversity benefits.
The authors provide code at a GitHub repository and specify hyperparameters ($\alpha=1$, $\beta=0.2$, $\lambda=0.01$), enabling reproduction of the core experiments. However, the 'arbitrary pairing' protocol lacks implementation details: whether pairings are generated exhaustively ($N^2$), randomly sampled, or selected via specific strategies remains unspecified. Computational cost analysis comparing training time for 15000 SPTP pairs versus 15000 APTP combinations is absent. The supplementary material contains network architectures but does not detail the specific pairing generation algorithm or random seeds used for the expanded datasets.
Infrared and visible image fusion(IVIF) combines complementary modalities while preserving natural textures and salient thermal signatures. Existing solutions predominantly rely on extensive sets of rigidly aligned image pairs for training. However, acquiring such data is often impractical due to the costly and labour-intensive alignment process. Besides, maintaining a rigid pairing setting during training restricts the volume of cross-modal relationships, thereby limiting generalisation performance. To this end, this work challenges the necessity of Strictly Paired Training Paradigm (SPTP) by systematically investigating UnPaired and Arbitrarily Paired Training Paradigms (UPTP and APTP) for high-performance IVIF. We establish a theoretical objective of APTP, reflecting the complementary nature between UPTP and SPTP. More importantly, we develop a practical framework capable of significantly enriching cross-modal relationships even with severely limited and unaligned training data. To validate our propositions, three end-to-end lightweight baselines, alongside a set of innovative loss functions, are designed to cover three classic frameworks (CNN, Transformer, GAN). Comprehensive experiments demonstrate that the proposed APTP and UPTP are feasible and capable of training models on a severely limited and content-inconsistent infrared and visible dataset, achieving performance comparable to that of a dataset 100$\times$ larger in SPTP. This finding fundamentally alleviates the cost and difficulty of data collection while enhancing model robustness from the data perspective, delivering a feasible solution for IVIF studies. The code is available at \href{https://github.com/yanglinDeng/IVIF_unpair}{\textcolor{blue}{https://github.com/yanglinDeng/IVIF\_unpair}}.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.