JANUS: A Lightweight Framework for Jailbreaking Text-to-Image Models via Distribution Optimization
JANUS addresses jailbreaking of text-to-image models by reframing the discrete prompt search as optimization over a structured distribution. The framework mixes two Gaussian-anchored prompt distributions—one around the target harmful prompt and one around a sanitized 'clean' version—and uses policy gradient on a single scalar mixing parameter $\alpha$ to maximize end-to-end reward. This avoids both proxy-loss optimization and costly LLM-based generators, achieving substantial efficiency gains while exposing weaknesses in current safety pipelines.
The paper presents a theoretically principled and empirically effective attack framework. The two-stage decomposition—semantic anchoring followed by adversarial optimization—is well-motivated, and the bound in Eq. (8) provides formal justification for the dual-anchor design. However, the practical applicability of the method is constrained by assumptions about access to NSFW scoring and safety classifier feedback that may not hold in realistic black-box settings against commercial APIs.
The dual-Gaussian construction effectively decouples semantic preservation from adversarial exploration, yielding strong empirical gains. The paper demonstrates an 18× speedup over prompt-level baselines (MMA/MMP) and achieves ASR-8 of 43.15% on SD3.5LT versus 25.30% for the best baseline. Ablation studies validate that both the dual-distribution design and dynamic reward signal are essential—the unimodal variant drops ASR significantly despite high TASR.
The threat model contains a critical tension: the method claims black-box access yet requires an NSFW scorer $S(\cdot)$ and exact safety classifier feedback $C(\cdot)$ to compute the reward $E(\mathbf{p})=-C(\mathbf{p},M(\mathbf{p}))\cdot S(M(\mathbf{p}))$ in Eq. (9). In practice, attackers querying commercial APIs like DALL·E3 or Midjourney only receive binary rejection signals, not continuous NSFW scores, and must use surrogate detectors that introduce distributional shift. Additionally, constructing the 'clean' anchor $\mathbf{p}_c$ requires removing 'predefined NSFW words,' assuming knowledge of the filter's trigger vocabulary that may not be available. Finally, the efficiency comparison with generator-level methods is partially misleading: JANUS optimizes per-target-prompt while SneakyPrompt learns a universal policy, incurring different amortized costs.
The evaluation is comprehensive across open-source and commercial models, with appropriate metrics (TASR, IASR-N, ASR-N, CLIP, NSFW Score) that disentangle filter bypass from content harmfulness. The comparison covers relevant baselines including prompt-level methods (MMA, MMP, QFA) and generator-level approaches (PGJ, SneakyPrompt). However, the paper does not evaluate against adaptive or distribution-aware defenses that might detect anomalies in the Gaussian-mixed prompt distributions, nor does it clarify how the NSFW ground truth is established for proprietary models.
The authors provide open-source code and specify hyperparameters (AdamW, lr=0.1, 20k iterations) and the dataset source (200 prompts from Civitai-8m-prompts). However, critical details are unspecified: the predefined list of NSFW words used to construct $\mathbf{p}_c$ is not provided, and reproducibility on commercial APIs is inherently limited by undocumented safety filter updates. Furthermore, the appendix acknowledges 8 weeks of GPU time on 8× RTX 4090 for all experiments, which is resource-intensive despite the 'lightweight' framing.
Text-to-image (T2I) models such as Stable Diffusion and DALLE remain susceptible to generating harmful or Not-Safe-For-Work (NSFW) content under jailbreak attacks despite deployed safety filters. Existing jailbreak attacks either rely on proxy-loss optimization instead of the true end-to-end objective, or depend on large-scale and costly RL-trained generators. Motivated by these limitations, we propose JANUS , a lightweight framework that formulates jailbreak as optimizing a structured prompt distribution under a black-box, end-to-end reward from the T2I system and its safety filters. JANUS replaces a high-capacity generator with a low-dimensional mixing policy over two semantically anchored prompt distributions, enabling efficient exploration while preserving the target semantics. On modern T2I models, we outperform state-of-the-art jailbreak methods, improving ASR-8 from 25.30% to 43.15% on Stable Diffusion 3.5 Large Turbo with consistently higher CLIP and NSFW scores. JANUS succeeds across both open-source and commercial models. These findings expose structural weaknesses in current T2I safety pipelines and motivate stronger, distribution-aware defenses. Warning: This paper contains model outputs that may be offensive.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.