Reward Sharpness-Aware Fine-Tuning for Diffusion Models

cs.LG cs.AI Kwanyoung Kim, Byeongsu Sim · Mar 22, 2026

What it does

Why it matters

The method is plug-and-play, compatible with existing RDRL frameworks like ReFL and DRaFT, and shows consistent gains across SD1. 5, SDXL, SD3, and Flux backbones.

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

This paper addresses reward hacking in reward-centric diffusion reinforcement learning (RDRL), where diffusion models exploit non-robust reward models to achieve high scores without actual perceptual quality improvements. The authors propose RSA-FT (Reward Sharpness-Aware Fine-Tuning), which mitigates hacking by flattening the reward landscape through joint perturbations in both image space (adversarial training) and parameter space (Sharpness-Aware Minimization). The method is plug-and-play, compatible with existing RDRL frameworks like ReFL and DRaFT, and shows consistent gains across SD1.5, SDXL, SD3, and Flux backbones.

Critical review

Verdict

Bottom line

RSA-FT presents a principled and empirically effective solution to reward hacking in diffusion models. By framing reward hacking as an adversarial robustness problem and leveraging the duality between SAM and AT, the authors deliver a method that consistently improves performance across SD1.5, SDXL, SD3, and Flux backbones when integrated with ReFL, DRaFT, AlignProp, and DRTune. The strong negative correlation between reward sharpness and human preference metrics (Pearson $r_{corr} = -0.802$) provides compelling empirical justification for the core hypothesis.

“Pearson $r_{corr} = -0.802$ for PickScore”

paper · Figure 4 caption

What holds up

The theoretical framing connecting reward hacking to adversarial examples is sound and well-supported by the observation that "reward models tend to be non-robust in regions where their loss landscape is sharp" (Sec. 2). The empirical validation in Figure 4 demonstrates a strong negative correlation between sharpness $S_1$ and proxy human preference metrics. The method's simplicity and broad compatibility are major strengths; Algorithm 1 shows a clean two-step perturbation process that requires no architectural changes or reward model retraining. Ablation studies in Table 6 confirm that both image-space (AT) and parameter-space (SAM) perturbations independently improve results, with joint application yielding synergetic gains.

“we observe that reward models tend to be non-robust in regions where their loss landscape is sharp”

paper · Section 2

“When combined, they yield the largest gains, demonstrating a clear synergistic effect”

paper · Section 5.2 / Table 6

Main concerns

The evaluation relies predominantly on automated reward metrics (HPSv2, PickScore, ImageReward) which may share the same adversarial vulnerabilities as the training reward model, creating a circularity risk. The human study mentioned in Section 6 involves only 17 annotators and is explicitly noted as not "statistically powered for definitive conclusions," leaving the core claim of improved human alignment under-verified. The theoretical justification assumes that flat minima generalize better—a heuristic from supervised learning that may not fully transfer to reward optimization in generative models. Additionally, hyperparameter selection for perturbation radii $\rho$ and $\rho_\omega$ appears limited to a coarse grid search over three values ($10^{-1}, 10^{-2}, 10^{-3}$).

“it is limited in scale (17 evaluators) and not statistically powered for definitive conclusions”

paper · Section 6

“we searched $\rho, \rho_w \in 10^{-1}, 10^{-2}, 10^{-3}$ and found both optimal at $10^{-2}$”

paper · Section 5.2

Evidence and comparison

The paper fairly positions itself against concurrent work like Flow-GRPO, RewardDance, and GARDO (Appendix C), distinguishing RSA-FT as requiring no auxiliary modules or reward scaling. Comparisons across four RDRL frameworks (ReFL, DRaFT-K, AlignProp, DRTune) and four backbones demonstrate consistent improvements, with Table 1 showing AlignProp+HPSv2.1 improving from 24.93 to 32.02 while ImageReward rises from 0.032 to 0.528, indicating genuine multi-metric gains rather than overfitting to a single reward. The method's effectiveness diminishes somewhat on stronger baselines (ReFL shows smaller gains than AlignProp), suggesting it may primarily help when the base RDRL method is prone to hacking.

“Alignprop 24.93 20.21 0.032 + Ours 32.02 (+7.09) 21.53 (+1.32) 0.528 (+0.49)”

paper · Table 1 (HPD dataset)

Reproducibility

The paper provides detailed experimental configurations in Appendix D, including optimizer settings ($\beta_1=0.9, \beta_2=0.999$), learning rates ($2\times 10^{-5}$), batch sizes, and LoRA ranks. Algorithm 1 clearly specifies the forward pass and gradient computation steps. However, the paper does not explicitly mention code availability or release, which would be necessary for full reproduction. The perturbation hyperparameters $\rho = \rho_\omega = 10^{-2}$ are reported as discovered via search, though the limited range tested raises questions about sensitivity.

“We perform all fine-tuning experiments using the AdamW optimizer... $\beta_1 = 0.9$, $\beta_2 = 0.999$, weight decay $\lambda = 1\times 10^{-4}$”

paper · Appendix D

Abstract

Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models with human preferences, inspiring the development of reward-centric diffusion reinforcement learning (RDRL) to achieve similar alignment and controllability. While diffusion models can generate high-quality outputs, RDRL remains susceptible to reward hacking, where the reward score increases without corresponding improvements in perceptual quality. We demonstrate that this vulnerability arises from the non-robustness of reward model gradients, particularly when the reward landscape with respect to the input image is sharp. To mitigate this issue, we introduce methods that exploit gradients from a robustified reward model without requiring its retraining. Specifically, we employ gradients from a flattened reward model, obtained through parameter perturbations of the diffusion model and perturbations of its generated samples. Empirically, each method independently alleviates reward hacking and improves robustness, while their joint use amplifies these benefits. Our resulting framework, RSA-FT (Reward Sharpness-Aware Fine-Tuning), is simple, broadly compatible, and consistently enhances the reliability of RDRL.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.