JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization

cs.CV cs.LG Haolun Zheng, Yu He, Tailun Chen, Shuo Shao, Zhixuan Chu, Hongbin Zhou, Lan Tao, Zhan Qin, Kui Ren · Mar 22, 2026
Local to this browser
What it does
JANUS addresses jailbreaking of text-to-image models by reframing the discrete prompt search as optimization over a structured distribution. The framework mixes two Gaussian-anchored prompt distributions—one around the target harmful...
Why it matters
The framework mixes two Gaussian-anchored prompt distributions—one around the target harmful prompt and one around a sanitized 'clean' version—and uses policy gradient on a single scalar mixing parameter $\alpha$ to maximize end-to-end...
Main concern
The paper presents a theoretically principled and empirically effective attack framework. The two-stage decomposition—semantic anchoring followed by adversarial optimization—is well-motivated, and the bound in Eq.
Community signal
0
0 up · 0 down
Sign in to vote with arrows
AI Review AI reviewed
Plain-language introduction

JANUS addresses jailbreaking of text-to-image models by reframing the discrete prompt search as optimization over a structured distribution. The framework mixes two Gaussian-anchored prompt distributions—one around the target harmful prompt and one around a sanitized 'clean' version—and uses policy gradient on a single scalar mixing parameter $\alpha$ to maximize end-to-end reward. This avoids both proxy-loss optimization and costly LLM-based generators, achieving substantial efficiency gains while exposing weaknesses in current safety pipelines.

Critical review
Verdict
Bottom line

The paper presents a theoretically principled and empirically effective attack framework. The two-stage decomposition—semantic anchoring followed by adversarial optimization—is well-motivated, and the bound in Eq. (8) provides formal justification for the dual-anchor design. However, the practical applicability of the method is constrained by assumptions about access to NSFW scoring and safety classifier feedback that may not hold in realistic black-box settings against commercial APIs.

“\mathbb{E}_{\mathbf{p}\sim p_{\alpha}}[\mathcal{L}(e(\mathbf{p}),\mathbf{e_{t}})]\geq\min(\mathbb{E}_{\mathbf{p}\sim N_{t}}[\mathcal{L}(e(\mathbf{p}),\mathbf{e_{t}})],\mathbb{E}_{\mathbf{p}\sim N_{c}}[\mathcal{L}(e(\mathbf{p}),\mathbf{e_{t}})])”
Section 3.3 · Eq. 8
What holds up

The dual-Gaussian construction effectively decouples semantic preservation from adversarial exploration, yielding strong empirical gains. The paper demonstrates an 18× speedup over prompt-level baselines (MMA/MMP) and achieves ASR-8 of 43.15% on SD3.5LT versus 25.30% for the best baseline. Ablation studies validate that both the dual-distribution design and dynamic reward signal are essential—the unimodal variant drops ASR significantly despite high TASR.

“JANUS demonstrates superior efficiency... achieves an approximate 18× speedup over MMA and 12× speedup over MMP”
Table 3 · Appendix A.2
“Although achieving a high TASR, its overall performance in terms of ASR and NSFW score is significantly lower”
Table 2 · Unimodal ablation
Main concerns

The threat model contains a critical tension: the method claims black-box access yet requires an NSFW scorer $S(\cdot)$ and exact safety classifier feedback $C(\cdot)$ to compute the reward $E(\mathbf{p})=-C(\mathbf{p},M(\mathbf{p}))\cdot S(M(\mathbf{p}))$ in Eq. (9). In practice, attackers querying commercial APIs like DALL·E3 or Midjourney only receive binary rejection signals, not continuous NSFW scores, and must use surrogate detectors that introduce distributional shift. Additionally, constructing the 'clean' anchor $\mathbf{p}_c$ requires removing 'predefined NSFW words,' assuming knowledge of the filter's trigger vocabulary that may not be available. Finally, the efficiency comparison with generator-level methods is partially misleading: JANUS optimizes per-target-prompt while SneakyPrompt learns a universal policy, incurring different amortized costs.

“attackers have no access to the model's parameters or gradients and can only obtain the generated images or rejection messages”
Section 3.1 · Threat Model
“E(\mathbf{p})=-C(\mathbf{p},M(\mathbf{p}))\cdot S(M(\mathbf{p}))”
Section 3.4 · Eq. 9
“remove all predefined NSFW words”
Section 3.3 · Clean Prompt Construction
Evidence and comparison

The evaluation is comprehensive across open-source and commercial models, with appropriate metrics (TASR, IASR-N, ASR-N, CLIP, NSFW Score) that disentangle filter bypass from content harmfulness. The comparison covers relevant baselines including prompt-level methods (MMA, MMP, QFA) and generator-level approaches (PGJ, SneakyPrompt). However, the paper does not evaluate against adaptive or distribution-aware defenses that might detect anomalies in the Gaussian-mixed prompt distributions, nor does it clarify how the NSFW ground truth is established for proprietary models.

“JANUS successfully performs jailbreak attacks on SD3.5LT and DALL·E3 under black-box settings”
Table 1 · Main Results
“ASR-NN represents the joint probability of an adversarial prompt successfully bypassing both text and image safety filters”
Section 4.1 · Metrics
Reproducibility

The authors provide open-source code and specify hyperparameters (AdamW, lr=0.1, 20k iterations) and the dataset source (200 prompts from Civitai-8m-prompts). However, critical details are unspecified: the predefined list of NSFW words used to construct $\mathbf{p}_c$ is not provided, and reproducibility on commercial APIs is inherently limited by undocumented safety filter updates. Furthermore, the appendix acknowledges 8 weeks of GPU time on 8× RTX 4090 for all experiments, which is resource-intensive despite the 'lightweight' framing.

“We set both the learning rate and the weight decay to 0.1 in 20000 training iterations”
Appendix A.1 · Implementation Details
“overall duration of all the experiments in the paper is about 8 weeks”
Appendix A · Setup
Abstract

Text-to-image (T2I) models such as Stable Diffusion and DALLE remain susceptible to generating harmful or Not-Safe-For-Work (NSFW) content under jailbreak attacks despite deployed safety filters. Existing jailbreak attacks either rely on proxy-loss optimization instead of the true end-to-end objective, or depend on large-scale and costly RL-trained generators. Motivated by these limitations, we propose JANUS , a lightweight framework that formulates jailbreak as optimizing a structured prompt distribution under a black-box, end-to-end reward from the T2I system and its safety filters. JANUS replaces a high-capacity generator with a low-dimensional mixing policy over two semantically anchored prompt distributions, enabling efficient exploration while preserving the target semantics. On modern T2I models, we outperform state-of-the-art jailbreak methods, improving ASR-8 from 25.30% to 43.15% on Stable Diffusion 3.5 Large Turbo with consistently higher CLIP and NSFW scores. JANUS succeeds across both open-source and commercial models. These findings expose structural weaknesses in current T2I safety pipelines and motivate stronger, distribution-aware defenses. Warning: This paper contains model outputs that may be offensive.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.