Structural bias in three-dimensional autoregressive generative machine learning of organic molecules
Abstract
A diverse range of generative machine learn ing models for the design of novel molecules and materials have been proposed in recent years. Models that are able to generate three dimensional structures are particularly suitable for quantum chemistry workflows, enabling di rect property prediction. The performance of generative models is typically assessed based on their ability to produce high rates of novel, valid, and unique molecules. However, equally important is the ability of generative models to learn the prevalence of functional groups and certain chemical moieties in the underly ing training data, that is, to faithfully repro duce the chemical space spanned by the training data. Here we investigate the ability of the au toregressive generative machine learning mode G-SchNet to reproduce the chemical space and property distributions of training datasets com posed of large, functional organic molecules. We assess the elemental composition, size- and bond-length distributions, as well as the func tional group and chemical space distribution of training and generated molecules. By princi pal component analysis of the chemical space, we find that the model leads to a biased gen eration that is largely unaffected by the choice of hyperparameters or the training dataset dis tribution, producing molecules that are, on average, more unsaturated and contain more heteroatoms. Purely aliphatic molecules are mostly absent in the generation. We further in vestigate generation with functional group con straints and based on composite datasets, which can help partially remedy the model generation bias. Decision tree models can recognize the generation bias in the models and discriminate between training and generated data, revealing key chemical differences between the two sets. The chemical differences we find affect the dis tributions of electronic properties such as the HOMO-LUMO gap, which is a common target for functional molecule design.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
"Structural bias in three-dimensional autoregressive generative machine learning of organic molecules", Technical Disclosure Commons, (April 29, 2025)
https://www.tdcommons.org/dpubs_series/8046