ADaFuSE: Adaptive Diffusion-generated Image and Text Fusion for Interactive Text-to-Image Retrieval

cs.IR cs.CV Zhuocheng Zhang, Xingwu Zhang, Kangheng Liang, Guanxuan Li, Richard Mccreadie, Zijun Long · Mar 23, 2026

What it does

Why it matters

The work matters because it challenges the assumption that diffusion-augmented retrieval always benefits from generated images, showing that up to 55. 62% of queries suffer degradation under static fusion.

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper addresses interactive text-to-image retrieval (I-TIR) where diffusion models generate visual proxies from dialogue, but static additive fusion of text and generated images introduces harmful noise. The core idea is ADaFuSE, a lightweight plug-in module combining adaptive gating (to dynamically weight modalities per instance) with a semantic-aware mixture-of-experts branch (to capture fine-grained cross-modal cues). The work matters because it challenges the assumption that diffusion-augmented retrieval always benefits from generated images, showing that up to 55.62% of queries suffer degradation under static fusion.

Critical review

Verdict

Bottom line

ADaFuSE presents a compelling solution to the overlooked problem of diffusion noise in multi-modal fusion. The dual-branch design—adaptive gating plus semantic-aware MoE—effectively addresses the limitation of fixed-weight fusion. Empirical results demonstrating a reduction in average rank drop from approximately 7500 to 20 for degraded queries are striking, though they warrant careful scrutiny given the magnitude. The method's plug-and-play nature and minimal parameter overhead (5.29% increase) make it a practical enhancement to existing pipelines.

“ADaFuSE markedly reduces the negative impact of noisy images, with the average rank drop for instances degraded by including the image averaging around 20 ranks, down from around 7,500 for DAR”

ADaFuSE paper · Section 4.3

“with only a 5.29% parameter increase”

ADaFuSE paper · Abstract

What holds up

The motivation is well-founded: the authors convincingly demonstrate that DAR's static fusion suffers from a degradation rate exceeding 50% from round 2 onwards. The adaptive gating mechanism ($\lambda_{n,i} \in (0,1)$) is theoretically sound for balancing text and diffusion-generated image reliability, while the MoE branch provides necessary capacity for non-linear cross-modal interactions that pure re-weighting cannot express. The consistent gains across four benchmarks (in-distribution and out-of-distribution) support generalizability, and the visualization in Figure 4 confirms that ADaFuSE assigns higher weights to generated images when their semantic alignment (cosine similarity) with text is high.

“DAR results in a degradation rate exceeding 50% starting from round 2”

ADaFuSE paper · Section 3.1

“$\lambda_{n,i}=\sigma(\mathbf{W}_{2}\cdot\delta(\mathbf{W}_{1}\mathbf{h}_{u}+\mathbf{b}_{1})+\mathbf{b}_{2})$ ... $\mathbf{z}_{n,i}^{base}=\lambda_{n,i}\cdot z_{n,i}^{T}+(1-\lambda_{n,i})\cdot z_{n,i}^{D}$”

ADaFuSE paper · Section 3.2, Eq. 4-5

Main concerns

The claimed reduction in average rank drop from ~7500 to ~20 raises questions about metric computation or outlier sensitivity, as this represents a 375x improvement that seems disproportionate to the modest 3.49% absolute gain in Hits@10. The paper lacks ablations isolating the MoE component from the gating mechanism—DAR serves as a non-adaptive baseline, but intermediate configurations (gating alone, MoE alone) are missing. Additionally, hyperparameters ($K$ experts, hidden dimensions $d'$, $d_{hidden}$) are unspecified in the provided text, and the DA-VisDial training set derives from the same author group as the hallucination analysis paper (Zhang et al., 2026), potentially creating data leakage concerns though test sets differ.

“Diffusion-Augmented Interactive Text-to-Image Retrieval (DAI-TIR) ... these generative views can be incorrect because diffusion generation may introduce hallucinated visual cues”

Zhang et al., 2026 · arXiv:2601.20391

Evidence and comparison

The evidence supports the central claim that static fusion is suboptimal and that adaptive weighting helps. Figure 4 shows clear correlation between semantic similarity and assigned image weight. However, the comparison to Composed Image Retrieval (CIR) is somewhat misleading—CIR uses real reference images while I-TIR uses synthetically generated ones—though the paper correctly notes this distinction. The comparison to DAR is fair (same backbone, same encoders), but the lack of comparison against other adaptive fusion mechanisms (e.g., cross-attention, FiLM layers without MoE) limits the understanding of what component drives gains.

“no prior work examines how to better fuse multi-modal query views for diffusion-augmented I-TIR”

ADaFuSE paper · Section 2

“as the semantic alignment (cosine similarity) between the text and generated image feature increases, ADaFuSE adaptively amplifies the contribution of the generated image”

ADaFuSE paper · Figure 4 caption and Section 4.4

Reproducibility

The authors provide an anonymous code repository, which is commendable. However, critical training details are absent from the provided text: the number of experts $K$, projection dimensions ($d'$, $d_{hidden}$), batch size, learning rate, and training epochs are not specified. The method relies on the DA-VisDial dataset for training, which may not be publicly available (described in a 2026 paper). The training uses symmetric InfoNCE loss with pre-trained BLIP initialization, which is standard, but without exact hyperparameters, independent reproduction would require significant tuning.

“We conduct our training on the DA-VisDial dataset ... optimize ADaFuSE in an end-to-end manner via the symmetric InfoNCE loss”

ADaFuSE paper · Section 4.1

“The code used in this paper is publicly available at: https://anonymous.4open.science/r/ADaFuSE-E149/README.md”

ADaFuSE paper · Section 4.1

Abstract

Recent advances in interactive text-to-image retrieval (I-TIR) use diffusion models to bridge the modality gap between the textual information need and the images to be searched, resulting in increased effectiveness. However, existing frameworks fuse multi-modal views of user feedback by simple embedding addition. In this work, we show that this static and undifferentiated fusion indiscriminately incorporates generative noise produced by the diffusion model, leading to performance degradation for up to 55.62% samples. We further propose ADaFuSE (Adaptive Diffusion-Text Fusion with Semantic-aware Experts), a lightweight fusion model designed to align and calibrate multi-modal views for diffusion-augmented I-TIR, which can be plugged into existing frameworks without modifying the backbone encoder. Specifically, we introduce a dual-branch fusion mechanism that employs an adaptive gating branch to dynamically balance modality reliability, alongside a semantic-aware mixture-of-experts branch to capture fine-grained cross-modal nuances. Via thorough evaluation over four standard I-TIR benchmarks, ADaFuSE achieves state-of-the-art performance, surpassing DAR by up to 3.49% in Hits@10 with only a 5.29% parameter increase, while exhibiting stronger robustness to noisy and longer interactive queries. These results show that generative augmentation coupled with principled fusion provides a simple, generalizable alternative to fine-tuning for interactive retrieval.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.