ADaFuSE: Adaptive Diffusion-generated Image and Text Fusion for Interactive Text-to-Image Retrieval
This paper addresses interactive text-to-image retrieval (I-TIR) where diffusion models generate visual proxies from dialogue, but static additive fusion of text and generated images introduces harmful noise. The core idea is ADaFuSE, a lightweight plug-in module combining adaptive gating (to dynamically weight modalities per instance) with a semantic-aware mixture-of-experts branch (to capture fine-grained cross-modal cues). The work matters because it challenges the assumption that diffusion-augmented retrieval always benefits from generated images, showing that up to 55.62% of queries suffer degradation under static fusion.
ADaFuSE presents a compelling solution to the overlooked problem of diffusion noise in multi-modal fusion. The dual-branch design—adaptive gating plus semantic-aware MoE—effectively addresses the limitation of fixed-weight fusion. Empirical results demonstrating a reduction in average rank drop from approximately 7500 to 20 for degraded queries are striking, though they warrant careful scrutiny given the magnitude. The method's plug-and-play nature and minimal parameter overhead (5.29% increase) make it a practical enhancement to existing pipelines.
The motivation is well-founded: the authors convincingly demonstrate that DAR's static fusion suffers from a degradation rate exceeding 50% from round 2 onwards. The adaptive gating mechanism ($\lambda_{n,i} \in (0,1)$) is theoretically sound for balancing text and diffusion-generated image reliability, while the MoE branch provides necessary capacity for non-linear cross-modal interactions that pure re-weighting cannot express. The consistent gains across four benchmarks (in-distribution and out-of-distribution) support generalizability, and the visualization in Figure 4 confirms that ADaFuSE assigns higher weights to generated images when their semantic alignment (cosine similarity) with text is high.
The claimed reduction in average rank drop from ~7500 to ~20 raises questions about metric computation or outlier sensitivity, as this represents a 375x improvement that seems disproportionate to the modest 3.49% absolute gain in Hits@10. The paper lacks ablations isolating the MoE component from the gating mechanism—DAR serves as a non-adaptive baseline, but intermediate configurations (gating alone, MoE alone) are missing. Additionally, hyperparameters ($K$ experts, hidden dimensions $d'$, $d_{hidden}$) are unspecified in the provided text, and the DA-VisDial training set derives from the same author group as the hallucination analysis paper (Zhang et al., 2026), potentially creating data leakage concerns though test sets differ.
The evidence supports the central claim that static fusion is suboptimal and that adaptive weighting helps. Figure 4 shows clear correlation between semantic similarity and assigned image weight. However, the comparison to Composed Image Retrieval (CIR) is somewhat misleading—CIR uses real reference images while I-TIR uses synthetically generated ones—though the paper correctly notes this distinction. The comparison to DAR is fair (same backbone, same encoders), but the lack of comparison against other adaptive fusion mechanisms (e.g., cross-attention, FiLM layers without MoE) limits the understanding of what component drives gains.
The authors provide an anonymous code repository, which is commendable. However, critical training details are absent from the provided text: the number of experts $K$, projection dimensions ($d'$, $d_{hidden}$), batch size, learning rate, and training epochs are not specified. The method relies on the DA-VisDial dataset for training, which may not be publicly available (described in a 2026 paper). The training uses symmetric InfoNCE loss with pre-trained BLIP initialization, which is standard, but without exact hyperparameters, independent reproduction would require significant tuning.
Recent advances in interactive text-to-image retrieval (I-TIR) use diffusion models to bridge the modality gap between the textual information need and the images to be searched, resulting in increased effectiveness. However, existing frameworks fuse multi-modal views of user feedback by simple embedding addition. In this work, we show that this static and undifferentiated fusion indiscriminately incorporates generative noise produced by the diffusion model, leading to performance degradation for up to 55.62% samples. We further propose ADaFuSE (Adaptive Diffusion-Text Fusion with Semantic-aware Experts), a lightweight fusion model designed to align and calibrate multi-modal views for diffusion-augmented I-TIR, which can be plugged into existing frameworks without modifying the backbone encoder. Specifically, we introduce a dual-branch fusion mechanism that employs an adaptive gating branch to dynamically balance modality reliability, alongside a semantic-aware mixture-of-experts branch to capture fine-grained cross-modal nuances. Via thorough evaluation over four standard I-TIR benchmarks, ADaFuSE achieves state-of-the-art performance, surpassing DAR by up to 3.49% in Hits@10 with only a 5.29% parameter increase, while exhibiting stronger robustness to noisy and longer interactive queries. These results show that generative augmentation coupled with principled fusion provides a simple, generalizable alternative to fine-tuning for interactive retrieval.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.