MS-CustomNet: Controllable Multi-Subject Customization with Hierarchical Relational Semantics
MS-CustomNet tackles multi-subject customization for text-to-image diffusion models, where the challenge is to preserve multiple subject identities while controlling their compositional arrangement and spatial relationships. The authors propose a framework built on CustomNet that accepts multiple reference images plus a layout map $M_L$ specifying spatial arrangement, trained on a curated MSI dataset derived from COCO. The work aims to provide explicit deterministic control over subject placement and layering (e.g., "cake inside bowl" vs "cake behind bowl") rather than relying on implicit text-to-image generation.
MS-CustomNet presents a reasonable architectural extension of CustomNet to multi-subject scenarios with explicit layout control, but the evaluation has significant methodological weaknesses. The paper excludes comparison with MS-Diffusion—the current state-of-the-art for layout-guided multi-subject personalization—based on the incorrect claim that it relies on "large-scale proprietary datasets." MS-Diffusion (Wang et al., arXiv:2406.07209) is publicly available and cites no proprietary training data. The evaluation metrics show MS-CustomNet actually underperforms SSR-Encoder on YOLO-Subj (0.68 vs 0.71), and the DINO-I score of 0.61 for multi-subject—while better than λ-ECLIPSE—is notably lower than single-subject baselines, making the abstract's claim of "superior capability" somewhat misleading when the method trades identity preservation for layout control.
The architectural design for explicit compositional control is sound. The category-aware projection network concatenating CLIP visual features $\mathbf{f}_{v,k}$ with category embeddings $\mathbf{e}_{c,k}$ via $\mathbf{f}_{s,k} = F_{\text{proj}}([\mathbf{f}_{v,k}; \mathbf{e}_{c,k}])$ provides a principled way to handle multi-subject conditioning. The Dual Stage Training (DST) and Curriculum Learning on Subject Quantity (CLSQ) strategies are reasonable training stabilizers, with the ablation study in Fig. 5 showing CLSQ recovers CLIP-B scores after the layout guidance initially degrade them. The YOLO-L metric of 0.94 does indicate strong spatial localization compared to SSR-Encoder's 0.91, validating the core claim of precise spatial control.
The primary methodological flaw is the exclusion of MS-Diffusion from comparisons based on false premises. The claim that MS-Diffusion uses "proprietary datasets" is factually incorrect—the paper is publicly available on arXiv with a project page. This omission allows MS-CustomNet to avoid comparison against a method that achieves state-of-the-art results with similar layout guidance. Additionally, the abstract champions a DINO-I score of 0.61 as demonstrating "superior capability," but Table I reveals this is significantly lower than single-subject CustomNet (0.77) and represents a trade-off rather than pure improvement. The training regime of only 10 epochs on 14,537 COCO-derived images is extremely lightweight for diffusion model standards, raising questions about generalization beyond COCO categories. Finally, the YOLO-Subj metric shows MS-CustomNet underperforms SSR-Encoder (0.68 vs 0.71), indicating inferior subject category accuracy despite better localization.
The evidence supports layout control efficacy (YOLO-L 0.94) but not overall superiority. The comparison to SSR-Encoder and λ-ECLIPSE is fair but omits the most relevant competitor. The MSI dataset construction from COCO using area thresholds $\beta=0.015$ and subject counts $N(y_s) \geq \alpha$ is transparent, though the resulting 14k images is small compared to typical diffusion training sets. The claim that MS-CustomNet offers "explicit, reproducible control" is validated by the layout map $M_L$ formulation, but the paper fails to demonstrate this on out-of-distribution subjects beyond standard COCO categories.
Reproducibility is mixed. The paper states implementation is based on "publicly available checkpoint of CustomNet" with training on two V100 GPUs using AdamW, automatic mixed precision FP16, and effective batch size 4. Hyperparameters are specified: 10 total epochs (7 at $\eta_1=1\times10^{-4}$, 3 at $\eta_2=5\times10^{-5}$), curriculum parameters $K_{\text{min}}=2$ to $K_{\text{max}}=5$ with $\gamma=1.0$. However, the MSI dataset preparation code is not released, and the reliance on COCO annotations means the exact filtering pipeline must be replicated precisely. No code repository link is provided in the text. The MSIBench benchmark construction using GPT for background prompts is described but not released.
Diffusion-based text-to-image generation has advanced significantly, yet customizing scenes with multiple distinct subjects while maintaining fine-grained control over their interactions remains challenging. Existing methods often struggle to provide explicit user-defined control over the compositional structure and precise spatial relationships between subjects. To address this, we introduce MS-CustomNet, a novel framework for multi-subject customization. MS-CustomNet allows zero-shot integration of multiple user-provided objects and, crucially, empowers users to explicitly define these hierarchical arrangements and spatial placements within the generated image. Our approach ensures individual subject identity preservation while learning and enacting these user-specified inter-subject compositions. We also present the MSI dataset, derived from COCO, to facilitate training on such complex multi-subject compositions. MS-CustomNet offers enhanced, fine-grained control over multi-subject image generation. Our method achieves a DINO-I score of 0.61 for identity preservation and a YOLO-L score of 0.94 for positional control in multi-subject customization tasks, demonstrating its superior capability in generating high-fidelity images with precise, user-directed multi-subject compositions and spatial control.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.