Data Curation for Machine Learning Interatomic Potentials by Determinantal Point Processes

stat.AP cs.LG Joanna Zou, Youssef Marzouk · Mar 23, 2026

What it does

Why it matters

This paper proposes using determinantal point processes (DPPs) to select diverse, informative subsets of configurations, mitigating the computational bottleneck while maintaining model accuracy. Experiments on hafnium oxide systems...

Main concern

Community signal

0 up · 0 down

AI Review AI reviewed

Plain-language introduction

Training machine learning interatomic potentials (MLIPs) requires costly quantum mechanical calculations to label atomic configurations. This paper proposes using determinantal point processes (DPPs) to select diverse, informative subsets of configurations, mitigating the computational bottleneck while maintaining model accuracy. Experiments on hafnium oxide systems demonstrate that DPP-based subselection achieves competitive or superior performance compared to existing methods like k-means clustering and MaxVol, offering a probabilistic framework that naturally handles variable training set sizes.

Critical review

Verdict

Bottom line

The paper presents a well-motivated application of DPPs to active learning for atomistic potentials, supported by systematic benchmarking across multiple training set sizes. The core claims—that DPPs provide competitive accuracy, improved diversity, and flexible subset selection—are substantiated by the hafnium oxide experiments, though the scope is limited to a single chemical system.

“Our work demonstrates the competitiveness of the DPP-based approach with respect to existing approaches to variable-size data subselection in terms of accuracy of the trained MLIP, diversity of sampled configurations, and generalization to predictions on unseen data.”

Zou & Marzouk · Section 5

What holds up

The theoretical grounding connecting DPPs to D-optimality principles via the Fisher information matrix provides a principled basis for the method. The empirical results show that DPP sampling consistently produces lower RMSE and higher $R^2$ values than simple random sampling and k-means across varying training set sizes $N \in \{100, 200, 400, 800, 1600, 3200, 6400\}$, with reduced variance across 100 trials. Additionally, the diversity analysis demonstrates that DPP-selected subsets cover a wider range of reference energies and force amplitudes than distance-based clustering methods.

“DPP outperforms the other variable-size data subselection methods, achieving low RMSE and high $R^2$ for each training set size $N$ with less dispersion as represented by the interquartile range.”

Zou & Marzouk · Section 4, Accuracy vs. training set size

“In Figure 3, the subsets drawn with DPP and MaxVol cover a greater range of energies and force amplitudes, whereas the subset drawn with kk-means has marginal distinction from that drawn by SRS.”

Zou & Marzouk · Section 4, Diversity of training set

Main concerns

The evaluation is narrowly focused on hafnium oxide systems, leaving open questions about transferability to the heterogeneous or multimodal datasets mentioned as future work. The comparison with MaxVol is structurally constrained: MaxVol is limited to selecting exactly $d=1160$ configurations (the descriptor dimension), forcing the authors to fix other methods at this same size despite their flexibility, which potentially handicaps the comparison. Notably, MaxVol achieves the highest $R^2$ value in the fixed-size comparison, suggesting that when the the D-optimality criterion can be directly optimized for the descriptor space, it remains competitive. The reliance on a simple linear kernel (normalized cosine similarity) without exploration of alternatives leaves room for optimization.

“MaxVol achieves an $R^2$ value closest to 1.”

Zou & Marzouk · Section 4, Accuracy vs. training set size

“Although SRS, kk-means, and DPP are flexible to select variable set sizes, we fix the set size sampled by each of these methods to $N=d=1160$ for the remainder of the studies, in order to maintain one-to-one comparison with the MaxVol algorithm.”

Zou & Marzouk · Section 4, Experimental setup

Evidence and comparison

The evidence supports the claim that DPPs outperform simple random sampling and k-means clustering in terms of both prediction accuracy and output-space diversity. However, the paper does not report statistical significance tests (e.g., p-values) for the accuracy differences, relying instead on interquartile ranges over 100 trials. The comparison to related work is generally fair but highlights a methodological gap: while DPPs offer probabilistic flexibility, the deterministic MaxVol algorithm achieves superior accuracy when restricted to the fixed size $N=d$, suggesting that the choice between methods involves trade-offs between flexibility and optimal determinant maximization.

Reproducibility

The authors provide sufficient implementation details for reproduction, specifying Julia packages including Determinantal.jl for DPP sampling, Maxvol.jl, and ACE.jl for descriptors. Appendix A.2 details the DFT calculation parameters (Quantum ESPRESSO with PBE functional, ONCV pseudopotential, 90 Ry cutoff). However, exact hyperparameters for the ACE potential training—such as regularization strength, solver tolerance, and random seeds for the 100 trials—are not explicitly documented, which could hinder exact numerical reproduction of the error statistics.

“Experiments were conducted in Julia using the packages PotentialLearning.jl and InteratomicPotentials.jl... Determinantal.jl... Maxvol.jl... ACE.jl... and Clustering.jl for the kk-means algorithm.”

Zou & Marzouk · Appendix A.5

Abstract

The development of machine learning interatomic potentials faces a critical computational bottleneck with the generation and labeling of useful training datasets. We present a novel application of determinantal point processes (DPPs) to the task of selecting informative subsets of atomic configurations to label with reference energies and forces from costly quantum mechanical methods. Through experiments with hafnium oxide data, we show that DPPs are competitive with existing approaches to constructing compact but diverse training sets by utilizing kernels of molecular descriptors, leading to improved accuracy and robustness in machine learning representations of molecular systems. Our work identifies promising directions to employ DPPs for unsupervised training data curation with heterogeneous or multimodal data, or in online active learning schemes for iterative data augmentation during molecular dynamics simulation.

Challenge the Review

Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.

Challenges are public to read, but only signed-in members can post them. Your challenge text is stored with your account for moderation, but usernames are not shown in the public thread.

No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.