Beyond Strict Pairing: Arbitrarily Paired Training for High-Performance Infrared and Visible Image Fusion

cs.CV Yanglin Deng, Tianyang Xu, Chunyang Cheng, Hui Li, Xiao-jun Wu, Josef Kittler · Mar 23, 2026

What it does

Why it matters

The authors propose UnPaired and Arbitrarily Paired Training Paradigms (UPTP and APTP), demonstrating that pixel-level self-supervision enables training on unaligned cross-modal combinations. By reformulating the maximum likelihood...

Main concern

The paper presents a compelling theoretical and empirical case for relaxing the strict pairing constraint in IVIF. The formulation of APTP as a superset containing both SPTP and UPTP (Eq.

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper challenges the long-held assumption that infrared and visible image fusion (IVIF) requires strictly paired training data. The authors propose UnPaired and Arbitrarily Paired Training Paradigms (UPTP and APTP), demonstrating that pixel-level self-supervision enables training on unaligned cross-modal combinations. By reformulating the maximum likelihood objective to treat infrared and visible images as independent variables, they show that a base dataset of $N$ pairs can be expanded to $N^2$ trainable combinations, potentially reducing collection costs while improving generalization.

Critical review

Verdict

Bottom line

The paper presents a compelling theoretical and empirical case for relaxing the strict pairing constraint in IVIF. The formulation of APTP as a superset containing both SPTP and UPTP (Eq. 14) is analytically elegant, and experiments across CNN, Transformer, and GAN architectures demonstrate that arbitrarily paired training achieves comparable metrics to strictly paired training on 100× larger datasets. The adaptive weighting mechanism $W(a;a,b)$ effectively handles cross-modal pixel selection without requiring spatial alignment.

“$\mathcal{D}_{\text{APTP}}=\mathcal{D}_{\text{SPTP}}\cup\mathcal{D}_{\text{UPTP}}\text{with}\begin{cases}\mathcal{D}_{\text{SPTP}}=\{(x_{i},x_{j})\mid i=j\},\\ \mathcal{D}_{\text{UPTP}}=\{(x_{i},x_{j})\mid i\neq j\}.\end{cases}$”

paper · Eq. 14

“achieving performance comparable to that of a dataset 100$\times$ larger in SPTP”

paper · Abstract

What holds up

The core insight—that pixel-level losses (intensity, gradient, SSIM) supervision can be computed from arbitrarily paired sources without content consistency—is validated convincingly. The adaptive loss formulation using $W(a;a,b)=\frac{a}{a+b}$ allows dynamic blending ratios without hand-crafted fusion rules. Cross-dataset experiments (Table 2) provide strong evidence that the model learns content-independent relationships, achieving superior MI and VIF scores when training on mismatched datasets (M3FD infrared + MSRS visible) compared to single-dataset SPTP baselines.

“$W(a;a,b)=\frac{a}{a+b}$”

paper · Section 3.1

“M3FD infrared + MSRS visible: 6.57 MI vs MSRS+MSRS SPTP: 2.57 MI”

paper · Table 2

Main concerns

The claim of 100× data reduction is misleading: APTP with 150 base pairs generates 15000 combinations through recombination, but the model still only accesses 150 unique images per modality. This is data augmentation via permutation, not collection reduction. The theoretical independence assumption $p(x_i^{ir},x_j^{vis})=p(x_i^{ir})\cdot p(x_j^{vis})$ (Eq. 12) treats positional correspondence as irrelevant, yet the paper provides no analysis of failure modes when scenes are drastically different (e.g., indoor vs outdoor). Table 4 shows the proposed method outperforms SOTA, but uses cross-dataset training (MSRS+M3FD) while competitors use single datasets, confounding the comparison. The GAN baseline shows minimal improvement over SPTP (Table 1), suggesting the paradigm may not benefit all architectures equally.

“$p(x_{i}^{ir},x_{j}^{vis})=p(x^{ir}_{i})\cdot p(x^{vis}_{j})$”

paper · Eq. 12

“trained on 150 paired images from MSRS and M3FD”

paper · Table 4 caption

“GAN baseline: SPTP (15000 pairs) vs APTP (150 pairs expanded) shows nearly identical performance”

paper · Table 1

Evidence and comparison

The comparison to SOTA methods in Table 4 demonstrates competitive quantitative metrics (EN, MI, VIF) but lacks fair controls—baseline models are trained on combined cross-datasets while competitors use their original protocols. The paper correctly identifies that UPTP achieves 'nearly identical' performance to SPTP with 150 pairs, but this sample size is already substantial; the claim of handling 'severely limited' data would require validation on <<50 pairs. The theoretical framing that APTP encompasses SPTP is sound, but experiments do not isolate whether performance gains stem from increased training iterations on synthetic pairs versus genuine diversity benefits.

“the fusion results under SPTP and UPTP are nearly identical across five evaluation metrics”

paper · Section 4.2.1

“using 150 paired images from MSRS and M3FD as visible and infrared source images, and expand the training set to 15,000 pairs”

paper · Section 4.3

Reproducibility

The authors provide code at a GitHub repository and specify hyperparameters ($\alpha=1$, $\beta=0.2$, $\lambda=0.01$), enabling reproduction of the core experiments. However, the 'arbitrary pairing' protocol lacks implementation details: whether pairings are generated exhaustively ($N^2$), randomly sampled, or selected via specific strategies remains unspecified. Computational cost analysis comparing training time for 15000 SPTP pairs versus 15000 APTP combinations is absent. The supplementary material contains network architectures but does not detail the specific pairing generation algorithm or random seeds used for the expanded datasets.

“The code is available at https://github.com/yanglinDeng/IVIF_unpair”

paper · Abstract

“hyperparameter settings for all experiments are the same, including $\alpha=1$, $\beta=0.2$, and $\lambda=0.01$”

paper · Section 4.1

“recombining source images... to 50$\times$ and 100$\times$ of its initial size”

paper · Section 3.4

Abstract

Infrared and visible image fusion(IVIF) combines complementary modalities while preserving natural textures and salient thermal signatures. Existing solutions predominantly rely on extensive sets of rigidly aligned image pairs for training. However, acquiring such data is often impractical due to the costly and labour-intensive alignment process. Besides, maintaining a rigid pairing setting during training restricts the volume of cross-modal relationships, thereby limiting generalisation performance. To this end, this work challenges the necessity of Strictly Paired Training Paradigm (SPTP) by systematically investigating UnPaired and Arbitrarily Paired Training Paradigms (UPTP and APTP) for high-performance IVIF. We establish a theoretical objective of APTP, reflecting the complementary nature between UPTP and SPTP. More importantly, we develop a practical framework capable of significantly enriching cross-modal relationships even with severely limited and unaligned training data. To validate our propositions, three end-to-end lightweight baselines, alongside a set of innovative loss functions, are designed to cover three classic frameworks (CNN, Transformer, GAN). Comprehensive experiments demonstrate that the proposed APTP and UPTP are feasible and capable of training models on a severely limited and content-inconsistent infrared and visible dataset, achieving performance comparable to that of a dataset 100$\times$ larger in SPTP. This finding fundamentally alleviates the cost and difficulty of data collection while enhancing model robustness from the data perspective, delivering a feasible solution for IVIF studies. The code is available at \href{https://github.com/yanglinDeng/IVIF_unpair}{\textcolor{blue}{https://github.com/yanglinDeng/IVIF\_unpair}}.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.