MS-CustomNet: Controllable Multi-Subject Customization with Hierarchical Relational Semantics

cs.CV Pengxiang Cai, Mengyang Li · Mar 22, 2026

What it does

Why it matters

g. , "cake inside bowl" vs "cake behind bowl") rather than relying on implicit text-to-image generation.

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

MS-CustomNet tackles multi-subject customization for text-to-image diffusion models, where the challenge is to preserve multiple subject identities while controlling their compositional arrangement and spatial relationships. The authors propose a framework built on CustomNet that accepts multiple reference images plus a layout map $M_L$ specifying spatial arrangement, trained on a curated MSI dataset derived from COCO. The work aims to provide explicit deterministic control over subject placement and layering (e.g., "cake inside bowl" vs "cake behind bowl") rather than relying on implicit text-to-image generation.

Critical review

Verdict

Bottom line

MS-CustomNet presents a reasonable architectural extension of CustomNet to multi-subject scenarios with explicit layout control, but the evaluation has significant methodological weaknesses. The paper excludes comparison with MS-Diffusion—the current state-of-the-art for layout-guided multi-subject personalization—based on the incorrect claim that it relies on "large-scale proprietary datasets." MS-Diffusion (Wang et al., arXiv:2406.07209) is publicly available and cites no proprietary training data. The evaluation metrics show MS-CustomNet actually underperforms SSR-Encoder on YOLO-Subj (0.68 vs 0.71), and the DINO-I score of 0.61 for multi-subject—while better than λ-ECLIPSE—is notably lower than single-subject baselines, making the abstract's claim of "superior capability" somewhat misleading when the method trades identity preservation for layout control.

“YOLO-Subj: 0.68 vs SSR-Encoder's 0.71”

MS-CustomNet paper · Section IV-A, Table I

“State-of-the-art methods such as MS-Diffusion are excluded from direct comparison in this study because they rely on large-scale proprietary datasets that are not publicly available”

MS-CustomNet paper · Introduction

“The project page is https://MS-Diffusion.github.io”

MS-Diffusion paper · Abstract

What holds up

The architectural design for explicit compositional control is sound. The category-aware projection network concatenating CLIP visual features $\mathbf{f}_{v,k}$ with category embeddings $\mathbf{e}_{c,k}$ via $\mathbf{f}_{s,k} = F_{\text{proj}}([\mathbf{f}_{v,k}; \mathbf{e}_{c,k}])$ provides a principled way to handle multi-subject conditioning. The Dual Stage Training (DST) and Curriculum Learning on Subject Quantity (CLSQ) strategies are reasonable training stabilizers, with the ablation study in Fig. 5 showing CLSQ recovers CLIP-B scores after the layout guidance initially degrade them. The YOLO-L metric of 0.94 does indicate strong spatial localization compared to SSR-Encoder's 0.91, validating the core claim of precise spatial control.

“$\mathbf{f}_{s,k} = F_{\text{proj}}([\mathbf{f}_{v,k}; \mathbf{e}_{c,k}])$”

MS-CustomNet paper · Section III-B, Eq. 8

“The staged learning process of CLSQ enables the model to first master foundational aspects before progressing to more nuanced tasks”

MS-CustomNet paper · Section IV-C, Fig. 5

Main concerns

The primary methodological flaw is the exclusion of MS-Diffusion from comparisons based on false premises. The claim that MS-Diffusion uses "proprietary datasets" is factually incorrect—the paper is publicly available on arXiv with a project page. This omission allows MS-CustomNet to avoid comparison against a method that achieves state-of-the-art results with similar layout guidance. Additionally, the abstract champions a DINO-I score of 0.61 as demonstrating "superior capability," but Table I reveals this is significantly lower than single-subject CustomNet (0.77) and represents a trade-off rather than pure improvement. The training regime of only 10 epochs on 14,537 COCO-derived images is extremely lightweight for diffusion model standards, raising questions about generalization beyond COCO categories. Finally, the YOLO-Subj metric shows MS-CustomNet underperforms SSR-Encoder (0.68 vs 0.71), indicating inferior subject category accuracy despite better localization.

“MS-CustomNet YOLO-Subj: 0.68, SSR-Encoder YOLO-Subj: 0.71”

MS-CustomNet paper · Table I

“MS-Diffusion are excluded... because they rely on large-scale proprietary datasets”

MS-CustomNet paper · Section IV

Evidence and comparison

The evidence supports layout control efficacy (YOLO-L 0.94) but not overall superiority. The comparison to SSR-Encoder and λ-ECLIPSE is fair but omits the most relevant competitor. The MSI dataset construction from COCO using area thresholds $\beta=0.015$ and subject counts $N(y_s) \geq \alpha$ is transparent, though the resulting 14k images is small compared to typical diffusion training sets. The claim that MS-CustomNet offers "explicit, reproducible control" is validated by the layout map $M_L$ formulation, but the paper fails to demonstrate this on out-of-distribution subjects beyond standard COCO categories.

“$\frac{\text{Area}(s_k)}{\text{Area}(y_s)} > \beta$ where $\beta=0.015$, and $N(y_s) \geq \alpha$ where $\alpha=2$”

MS-CustomNet paper · Section III-A, Eq. 1-2

“SSR-Encoder may struggle to disambiguate... cake-in-bowl versus cake-behind-bowl scenarios”

MS-CustomNet paper · Section IV-B2

Reproducibility

Reproducibility is mixed. The paper states implementation is based on "publicly available checkpoint of CustomNet" with training on two V100 GPUs using AdamW, automatic mixed precision FP16, and effective batch size 4. Hyperparameters are specified: 10 total epochs (7 at $\eta_1=1\times10^{-4}$, 3 at $\eta_2=5\times10^{-5}$), curriculum parameters $K_{\text{min}}=2$ to $K_{\text{max}}=5$ with $\gamma=1.0$. However, the MSI dataset preparation code is not released, and the reliance on COCO annotations means the exact filtering pipeline must be replicated precisely. No code repository link is provided in the text. The MSIBench benchmark construction using GPT for background prompts is described but not released.

“Our implementation is built upon the publicly available checkpoint of CustomNet... trained using a Dual Stage Training regimen for a total of 10 epochs”

MS-CustomNet paper · Section IV-A

“$K_{\text{min}}=2$ to $K_{\text{max}}=5$, governed by a curriculum pace of $\gamma=1.0$”

MS-CustomNet paper · Section IV-A

Abstract

Diffusion-based text-to-image generation has advanced significantly, yet customizing scenes with multiple distinct subjects while maintaining fine-grained control over their interactions remains challenging. Existing methods often struggle to provide explicit user-defined control over the compositional structure and precise spatial relationships between subjects. To address this, we introduce MS-CustomNet, a novel framework for multi-subject customization. MS-CustomNet allows zero-shot integration of multiple user-provided objects and, crucially, empowers users to explicitly define these hierarchical arrangements and spatial placements within the generated image. Our approach ensures individual subject identity preservation while learning and enacting these user-specified inter-subject compositions. We also present the MSI dataset, derived from COCO, to facilitate training on such complex multi-subject compositions. MS-CustomNet offers enhanced, fine-grained control over multi-subject image generation. Our method achieves a DINO-I score of 0.61 for identity preservation and a YOLO-L score of 0.94 for positional control in multi-subject customization tasks, demonstrating its superior capability in generating high-fidelity images with precise, user-directed multi-subject compositions and spatial control.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.