Data Curation for Machine Learning Interatomic Potentials by Determinantal Point Processes
Training machine learning interatomic potentials (MLIPs) requires costly quantum mechanical calculations to label atomic configurations. This paper proposes using determinantal point processes (DPPs) to select diverse, informative subsets of configurations, mitigating the computational bottleneck while maintaining model accuracy. Experiments on hafnium oxide systems demonstrate that DPP-based subselection achieves competitive or superior performance compared to existing methods like k-means clustering and MaxVol, offering a probabilistic framework that naturally handles variable training set sizes.
The paper presents a well-motivated application of DPPs to active learning for atomistic potentials, supported by systematic benchmarking across multiple training set sizes. The core claims—that DPPs provide competitive accuracy, improved diversity, and flexible subset selection—are substantiated by the hafnium oxide experiments, though the scope is limited to a single chemical system.
The theoretical grounding connecting DPPs to D-optimality principles via the Fisher information matrix provides a principled basis for the method. The empirical results show that DPP sampling consistently produces lower RMSE and higher $R^2$ values than simple random sampling and k-means across varying training set sizes $N \in \{100, 200, 400, 800, 1600, 3200, 6400\}$, with reduced variance across 100 trials. Additionally, the diversity analysis demonstrates that DPP-selected subsets cover a wider range of reference energies and force amplitudes than distance-based clustering methods.
The evaluation is narrowly focused on hafnium oxide systems, leaving open questions about transferability to the heterogeneous or multimodal datasets mentioned as future work. The comparison with MaxVol is structurally constrained: MaxVol is limited to selecting exactly $d=1160$ configurations (the descriptor dimension), forcing the authors to fix other methods at this same size despite their flexibility, which potentially handicaps the comparison. Notably, MaxVol achieves the highest $R^2$ value in the fixed-size comparison, suggesting that when the the D-optimality criterion can be directly optimized for the descriptor space, it remains competitive. The reliance on a simple linear kernel (normalized cosine similarity) without exploration of alternatives leaves room for optimization.
The evidence supports the claim that DPPs outperform simple random sampling and k-means clustering in terms of both prediction accuracy and output-space diversity. However, the paper does not report statistical significance tests (e.g., p-values) for the accuracy differences, relying instead on interquartile ranges over 100 trials. The comparison to related work is generally fair but highlights a methodological gap: while DPPs offer probabilistic flexibility, the deterministic MaxVol algorithm achieves superior accuracy when restricted to the fixed size $N=d$, suggesting that the choice between methods involves trade-offs between flexibility and optimal determinant maximization.
The authors provide sufficient implementation details for reproduction, specifying Julia packages including Determinantal.jl for DPP sampling, Maxvol.jl, and ACE.jl for descriptors. Appendix A.2 details the DFT calculation parameters (Quantum ESPRESSO with PBE functional, ONCV pseudopotential, 90 Ry cutoff). However, exact hyperparameters for the ACE potential training—such as regularization strength, solver tolerance, and random seeds for the 100 trials—are not explicitly documented, which could hinder exact numerical reproduction of the error statistics.
The development of machine learning interatomic potentials faces a critical computational bottleneck with the generation and labeling of useful training datasets. We present a novel application of determinantal point processes (DPPs) to the task of selecting informative subsets of atomic configurations to label with reference energies and forces from costly quantum mechanical methods. Through experiments with hafnium oxide data, we show that DPPs are competitive with existing approaches to constructing compact but diverse training sets by utilizing kernels of molecular descriptors, leading to improved accuracy and robustness in machine learning representations of molecular systems. Our work identifies promising directions to employ DPPs for unsupervised training data curation with heterogeneous or multimodal data, or in online active learning schemes for iterative data augmentation during molecular dynamics simulation.
Pick a starting point or write your own. Challenges run in the background, so you can keep reading while the AI investigates.
No challenges yet. Disagree with the review? Ask the AI to revisit a specific claim.