GIDE: Unlocking Diffusion LLMs for Precise Training-Free Image Editing

cs.CV Zifeng Zhu, Jiaming Han, Jiaxiang Zhao, Minnan Luo, Xiangyu Yue · Mar 22, 2026
Local to this browser
What it does
GIDE addresses a key challenge in image editing: applying training-free editing techniques to Diffusion Large Language Models (DLLMs). Unlike continuous diffusion models where DDIM inversion is well-established, DLLMs use discrete...
Why it matters
GIDE introduces a three-stage framework (grounding, inversion, refinement) that enables precise localized editing via points, boxes, or text prompts while preserving background content. The significance lies in bridging discrete token...
Main concern
GIDE is a technically sound contribution that delivers on its core claims. The decomposition into Grounding (where to edit), Discrete Inversion (how to preserve structure), and Refinement (quality smoothing) represents a coherent pipeline...
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

GIDE addresses a key challenge in image editing: applying training-free editing techniques to Diffusion Large Language Models (DLLMs). Unlike continuous diffusion models where DDIM inversion is well-established, DLLMs use discrete tokenization that prevents direct application of standard noise inversion. GIDE introduces a three-stage framework (grounding, inversion, refinement) that enables precise localized editing via points, boxes, or text prompts while preserving background content. The significance lies in bridging discrete token spaces with high-fidelity inversion without additional training.

Critical review
Verdict
Bottom line

GIDE is a technically sound contribution that delivers on its core claims. The decomposition into Grounding (where to edit), Discrete Inversion (how to preserve structure), and Refinement (quality smoothing) represents a coherent pipeline design. The discrete inversion mechanism with stochastic logit rectification ($\tilde{\boldsymbol{y}}_{t}=\hat{\boldsymbol{y}}_{t}+\lambda\cdot\boldsymbol{z}_{t}+(1-\lambda)\cdot\boldsymbol{g}$) is novel and addresses the non-deterministic nature of DLLM generation. Experiments on GIDE-Bench (805 cases) and ImgEdit-Bench demonstrate clear advantages over training-free baselines, with the authors claiming 51.83% improvement in Semantic Correctness over DICE.

“GIDE significantly outperforms prior training-free methods, improving Semantic Correctness by 51.83% and Perceptual Quality by 50.39%”
GIDE paper · Abstract
“\tilde{\boldsymbol{y}}_{t}=\hat{\boldsymbol{y}}_{t}+\lambda\cdot\boldsymbol{z}_{t}+(1-\lambda)\cdot\boldsymbol{g}”
GIDE paper · Algorithm 1
“surpassing DICE by 51.83% in SC and 50.39% in PQ”
GIDE paper · Section 5.2
What holds up

The modular three-stage design (Grounding → Inversion → Refinement) is well-justified and enables flexible support for diverse editing instructions. The grounding-aware masking strategy explicitly constrains edits to specified regions by setting scores outside the grounding mask $\mathbf{M}$ to $-\infty$, which directly addresses background leakage. The sinusoidal masking schedule ($n_{t}=\lfloor N\cdot\sin(\frac{\pi t}{2T})\rfloor$) is theoretically motivated for structure preservation. The ablation study (Table 3) provides strong evidence for each component's contribution, with spatial grounding showing the largest impact (589.55% MSE increase when removed).

“scores for tokens outside $\mathbf{M}$ are set to $-\infty$”
GIDE paper · Section 3.1
“n_{t}=\lfloor N\cdot\sin(\frac{\pi t}{2T})\rfloor”
GIDE paper · Section 3.1
“w/o spatial grounding: 8446.17 MSE (590%↑)”
GIDE paper · Table 3
Main concerns

The evaluation relies heavily on GPT-4o and Gemini for semantic correctness and perceptual quality metrics, which introduces potential evaluator bias. The authors acknowledge this limitation but present the scores without human validation. The 51.83% improvement claim over DICE is dramatic—Section 5.2 notes DICE's MSE is 8323.89 vs GIDE's 1224.89—yet DICE was self-implemented ("Since DICE does not open-source its code, we implement it by ourself"), raising reproducibility concerns for this baseline comparison. The reliance on off-the-shelf segmentation models (SAM 3 for text/box, SAM 2 for points) introduces a dependency on external model quality; failures in grounding cascade to the inversion stage. The VQModel reconstruction artifacts (Appendix C) represent an inherent limitation that the authors downplay as "temporary" but which materially affects high-frequency details.

“Since DICE does not open-source its code, we implement it by ourself based on a DLLM Lumina-DiMOO”
GIDE paper · Footnote 3
“DICE+Lumina-DiMOO: MSE 8323.89”
GIDE paper · Table 1
“subtle texture and color deviations become visible... These gentle variations illustrate how the current VQModel handles fine-grained details”
GIDE paper · Appendix 0.C
Evidence and comparison

The evidence supports GIDE's superiority over training-free baselines but comparisons to end-to-end trained models require nuance. GIDE+Lumina-DiMOO achieves SC 4.47 vs Nano-Banana-1's 4.48 on GIDE-Bench, effectively matching this trained baseline. However, GPT-Image-1 (trained) achieves 4.71 SC despite worse background preservation—a trade-off the authors acknowledge. The GIDE-Bench benchmark itself is a contribution, addressing compositional editing with 805 cases spanning point, box, and text modalities. The benchmark's dependence on "dynamic grounding" via GPT-4o for mask generation during evaluation may create circular reasoning concerns since GIDE uses similar grounding mechanisms.

“GIDE+Lumina-DiMOO: 4.47 SC; Nano-Banana-1: 4.48 SC”
GIDE paper · Table 1
“GPT-Image-1: 4.71 SC, 5080.22 MSE”
GIDE paper · Table 1
“GIDE-Bench comprising 805 compositional editing scenarios guided by diverse multi-modal inputs”
GIDE paper · Section 4
Reproducibility

Reproducibility is mixed. The authors commit to code release ("Data and code are available at https://github.com/Zivenzhu/GIDE"), which will enable independent verification. Hyperparameters are specified in Appendix D (cfg_scale=4.0, timesteps=64, $\lambda=0.2$). However, critical implementation details for the DICE baseline—used to establish the primary quantitative claim—are unavailable due to DICE lacking open-source code. The self-reported DICE implementation may not reflect the method's full potential. The grounding module dependency on SAM 3 and SAM 2 requires external API access or model weights. The VLM-based evaluation (GPT-4o, Gemini) requires API access with potential version sensitivity affecting reproducibility.

“Data and code are available at https://github.com/Zivenzhu/GIDE”
GIDE paper · Abstract footnote
“cfg_scale = 4.0, timesteps = 64, temperature = 1.0”
GIDE paper · Appendix 0.D
“Since DICE does not open-source its code, we implement it by ourself”
GIDE paper · Section 5.1 footnote 3
Abstract

While Diffusion Large Language Models (DLLMs) have demonstrated remarkable capabilities in multi-modal generation, performing precise, training-free image editing remains an open challenge. Unlike continuous diffusion models, the discrete tokenization inherent in DLLMs hinders the application of standard noise inversion techniques, often leading to structural degradation during editing. In this paper, we introduce GIDE (Grounded Inversion for DLLM Image Editing), a unified framework designed to bridge this gap. GIDE incorporates a novel Discrete Noise Inversion mechanism that accurately captures latent noise patterns within the discrete token space, ensuring high-fidelity reconstruction. We then decompose the editing pipeline into grounding, inversion, and refinement stages. This design enables GIDE supporting various editing instructions (text, point and box) and operations while strictly preserving the unedited background. Furthermore, to overcome the limitations of existing single-step evaluation protocols, we introduce GIDE-Bench, a rigorous benchmark comprising 805 compositional editing scenarios guided by diverse multi-modal inputs. Extensive experiments on GIDE-Bench demonstrate that GIDE significantly outperforms prior training-free methods, improving Semantic Correctness by 51.83% and Perceptual Quality by 50.39%. Additional evaluations on ImgEdit-Bench confirm its broad applicability, demonstrating consistent gains over trained baselines and yielding photorealistic consistency on par with leading models.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.