Reward Sharpness-Aware Fine-Tuning for Diffusion Models
This paper addresses reward hacking in reward-centric diffusion reinforcement learning (RDRL), where diffusion models exploit non-robust reward models to achieve high scores without actual perceptual quality improvements. The authors propose RSA-FT (Reward Sharpness-Aware Fine-Tuning), which mitigates hacking by flattening the reward landscape through joint perturbations in both image space (adversarial training) and parameter space (Sharpness-Aware Minimization). The method is plug-and-play, compatible with existing RDRL frameworks like ReFL and DRaFT, and shows consistent gains across SD1.5, SDXL, SD3, and Flux backbones.
RSA-FT presents a principled and empirically effective solution to reward hacking in diffusion models. By framing reward hacking as an adversarial robustness problem and leveraging the duality between SAM and AT, the authors deliver a method that consistently improves performance across SD1.5, SDXL, SD3, and Flux backbones when integrated with ReFL, DRaFT, AlignProp, and DRTune. The strong negative correlation between reward sharpness and human preference metrics (Pearson $r_{corr} = -0.802$) provides compelling empirical justification for the core hypothesis.
The theoretical framing connecting reward hacking to adversarial examples is sound and well-supported by the observation that "reward models tend to be non-robust in regions where their loss landscape is sharp" (Sec. 2). The empirical validation in Figure 4 demonstrates a strong negative correlation between sharpness $S_1$ and proxy human preference metrics. The method's simplicity and broad compatibility are major strengths; Algorithm 1 shows a clean two-step perturbation process that requires no architectural changes or reward model retraining. Ablation studies in Table 6 confirm that both image-space (AT) and parameter-space (SAM) perturbations independently improve results, with joint application yielding synergetic gains.
The evaluation relies predominantly on automated reward metrics (HPSv2, PickScore, ImageReward) which may share the same adversarial vulnerabilities as the training reward model, creating a circularity risk. The human study mentioned in Section 6 involves only 17 annotators and is explicitly noted as not "statistically powered for definitive conclusions," leaving the core claim of improved human alignment under-verified. The theoretical justification assumes that flat minima generalize better—a heuristic from supervised learning that may not fully transfer to reward optimization in generative models. Additionally, hyperparameter selection for perturbation radii $\rho$ and $\rho_\omega$ appears limited to a coarse grid search over three values ($10^{-1}, 10^{-2}, 10^{-3}$).
The paper fairly positions itself against concurrent work like Flow-GRPO, RewardDance, and GARDO (Appendix C), distinguishing RSA-FT as requiring no auxiliary modules or reward scaling. Comparisons across four RDRL frameworks (ReFL, DRaFT-K, AlignProp, DRTune) and four backbones demonstrate consistent improvements, with Table 1 showing AlignProp+HPSv2.1 improving from 24.93 to 32.02 while ImageReward rises from 0.032 to 0.528, indicating genuine multi-metric gains rather than overfitting to a single reward. The method's effectiveness diminishes somewhat on stronger baselines (ReFL shows smaller gains than AlignProp), suggesting it may primarily help when the base RDRL method is prone to hacking.
The paper provides detailed experimental configurations in Appendix D, including optimizer settings ($\beta_1=0.9, \beta_2=0.999$), learning rates ($2\times 10^{-5}$), batch sizes, and LoRA ranks. Algorithm 1 clearly specifies the forward pass and gradient computation steps. However, the paper does not explicitly mention code availability or release, which would be necessary for full reproduction. The perturbation hyperparameters $\rho = \rho_\omega = 10^{-2}$ are reported as discovered via search, though the limited range tested raises questions about sensitivity.
Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models with human preferences, inspiring the development of reward-centric diffusion reinforcement learning (RDRL) to achieve similar alignment and controllability. While diffusion models can generate high-quality outputs, RDRL remains susceptible to reward hacking, where the reward score increases without corresponding improvements in perceptual quality. We demonstrate that this vulnerability arises from the non-robustness of reward model gradients, particularly when the reward landscape with respect to the input image is sharp. To mitigate this issue, we introduce methods that exploit gradients from a robustified reward model without requiring its retraining. Specifically, we employ gradients from a flattened reward model, obtained through parameter perturbations of the diffusion model and perturbations of its generated samples. Empirically, each method independently alleviates reward hacking and improves robustness, while their joint use amplifies these benefits. Our resulting framework, RSA-FT (Reward Sharpness-Aware Fine-Tuning), is simple, broadly compatible, and consistently enhances the reliability of RDRL.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.