DMMRL: Disentangled Multi-Modal Representation Learning via Variational Autoencoders for Molecular Property Prediction

cs.LG cs.AI Long Xu, Junping Guo, Jianbo Zhao, Jianbo Lu, Yuzhong Peng · Mar 22, 2026

What it does

Why it matters

The method uses variational autoencoders to decompose graph, sequence, and geometry features into shared (structure-relevant) and private (modality-specific) latent subspaces, enforcing orthogonality between them. A gated attention...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

DMMRL tackles molecular property prediction by addressing two key challenges: entangled representations that obscure structure-property relationships and naïve multi-modal fusion that ignores inter-modal dependencies. The method uses variational autoencoders to decompose graph, sequence, and geometry features into shared (structure-relevant) and private (modality-specific) latent subspaces, enforcing orthogonality between them. A gated attention mechanism then fuses only the shared representations for downstream prediction.

Critical review

Verdict

Bottom line

DMMRL presents a principled approach to multi-modal molecular representation learning with moderate but consistent empirical gains. The disentanglement framework is theoretically sound and the ablation studies demonstrate that the proposed regularization terms (KL, MMD, alignment, orthogonality) contribute incrementally to performance. However, the evaluation relies exclusively on random splits rather than more challenging scaffold splits, and the claimed superiority over SGGRL—the closest baseline—relies on using the same encoders (which the paper asserts but does not rigorously verify). The improvements, while statistically consistent, are often marginal (e.g., BBBP improvement of 0.001 ROC-AUC over SGGRL).

What holds up

The ablation study provides convincing evidence that the variational bottleneck (BOT configuration) provides substantial gains over a naive prediction pathway (LBL), particularly for ESOL where RMSE drops from 0.614 to 0.563. The complete DMMRL configuration further improves results and notably reduces prediction variance across runs, suggesting that the disentanglement constraints enhance model stability. The gated attention fusion is a sensible architectural choice that adaptively weights modalities rather than using static concatenation.

“The transition from LBL to BOT yielded particularly substantial gains across multiple datasets (3.5 percentage point ROC-AUC improvement for BBBP; 8.3% RMSE reduction for ESOL)”

DMMRL paper · Table III

Main concerns

First, the evaluation methodology employs only random splitting (8:1:1 ratio), which is known to overestimate performance in drug discovery settings where generalization to novel chemical scaffolds is required. The authors acknowledge this limitation in the conclusion but do not address it experimentally. Second, the claim that DMMRL uses the exact same encoders as SGGRL [24]—a critical assertion for fair comparison—is questionable; SGGRL uses GIN-based graph encoders whereas DMMRL uses CMPNN. Third, the ClinTox result in Table II (0.935±0.021) is actually lower than SGGRL's 0.956±0.016, yet the text claims best performance on five of seven datasets, which is technically true but omits this underperformance. Finally, the interpretability claims remain largely theoretical; no qualitative analysis (e.g., visualizing which substructures activate specific shared latent dimensions) is provided to substantiate that disentanglement actually yields mechanistic insights.

“ClinTox: SGGRL 0.956±0.016 vs DMMRL 0.935±0.021”

DMMRL paper · Table II

“datasets were partitioned using random splitting according to established MoleculeNet protocols, with an 8:1:1 ratio”

DMMRL paper · Section III-A

Evidence and comparison

The experimental evidence supports the utility of the disentanglement mechanism, though the magnitude of improvement varies by dataset. The comparison to MvMRL and SGGRL is generally fair, though the paper curiously omits Uni-Mol, 3DInfomax, and other recent geometry-aware pre-training methods that might provide stronger baselines. The use of InfoNCE for alignment and MMD for private space regularization follows established practices from multi-view learning literature [13,30], grounding the method in prior work. However, the reconstruction loss $\mathcal{L}_{recon}$ encourages autoencoder-style reconstruction of encoder outputs $H_m$ rather than raw inputs, which differs from standard VAE practice and may limit the information bottleneck effect.

“$\mathcal{L}_{recon} = \frac{1}{M}\sum_{m=1}^{M}\|H_{m}-\hat{H}_{m}\|_{2}^{2}$”

DMMRL paper · Equation 18

Reproducibility

The authors state that code and data are available at https://github.com/xulong0826/DMMRL, which supports reproducibility. However, the paper omits critical hyperparameter details such as the specific dimensions $d_s$ and $d_p$ for shared and private latent spaces, noting only that they are "selected according to the specific task and dataset" with details in the code. The loss coefficients ($\beta$, $\lambda$, $\gamma$, $\delta$, $\eta$) are described as "learnable parameters with initial values set to 0.1," which is unusual—typically these are fixed hyperparameters or scheduled. Without explicit values or scheduling strategies, exact reproduction is hindered. Training required approximately 2 days on an RTX 4060 Ti, which is reasonable for follow-up studies.

“All regularization weights, including $\beta$ for the KL divergence term, $\lambda$ for the MMD term, $\gamma$ for the alignment loss, $\delta$ for the orthogonality constraint, and $\eta$ for the reconstruction loss, are learnable parameters with initial values set to 0.1”

DMMRL paper · Section III-A

Abstract

Molecular property prediction constitutes a cornerstone of drug discovery and materials science, necessitating models capable of disentangling complex structure-property relationships across diverse molecular modalities. Existing approaches frequently exhibit entangled representations--conflating structural, chemical, and functional factors--thereby limiting interpretability and transferability. Furthermore, conventional methods inadequately exploit complementary information from graphs, sequences, and geometries, often relying on naive concatenation that neglects inter-modal dependencies. In this work, we propose DMMRL, which employs variational autoencoders to disentangle molecular representations into shared (structure-relevant) and private (modality-specific) latent spaces, enhancing both interpretability and predictive performance. The proposed variational disentanglement mechanism effectively isolates the most informative features for property prediction, while orthogonality and alignment regularizations promote statistical independence and cross-modal consistency. Additionally, a gated attention fusion module adaptively integrates shared representations, capturing complex inter-modal relationships. Experimental validation across seven benchmark datasets demonstrates DMMRL's superior performance relative to state-of-the-art approaches. The code and data underlying this article are freely available at https://github.com/xulong0826/DMMRL.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.