Research Question
Can we incorporate chemistry knowledge for conditional molecule generation in diffusion models?
Motivation
- training conditional models require a large amount of labeled data, which is time-consuming to acquire for chemistry (e.g., using DFT calculations)
- (unconditional) diffusion models can be made conditional during inference time via the gradient from posterior p(y|x), where x, y being the molecule and predicted property, respectively.
- molecule properties are best predicted by chemistry software (e.g., xTB) that is non-differentiable (e.g., can't be backpropagated with neural networks)
- gradients from a non-differentiable oracle can be estimated using zeroth-order optimization methods
Conclusion
- ChemGuide significantly reduces atomic forces, enhancing the validity of generated molecules when used for stability optimization
- ChemGuide is compatible with both explicit and implicit guidance in diffusion models, enabling joint optimization of molecular properties and stability
- ChemGuide generalizes effectively to molecular optimization tasks beyond stability optimization.
-
environment
- GeoLDM: please follow the instructions [here] to set up the environment for GeoLDM
- xTB guidance: please follow the instructions [here] to set up xTB. Note that there are two ways of setting up xTB: xTB-python and xTB, please don't use xTB-python, use
Precompiled Binaries from GitHub
instead. To properly install xTB from binaries, please download xTB from Github ([here]). In this paper, we use version 6.6.1. - metric calculation:
- in this work we use metric from GeoLDM [here] and autodE [here]. Note that autodE calls xTB to calculate metrics, so please have xTB already installed before using autodE. To speed up autodE, we modified their code to decrease I/O times, please use
autode
directory in our repo. - to calculate the metric, please use the function in
calculate_metric.py
and adapt it to your project.
- in this work we use metric from GeoLDM [here] and autodE [here]. Note that autodE calls xTB to calculate metrics, so please have xTB already installed before using autodE. To speed up autodE, we modified their code to decrease I/O times, please use
-
data
please follow the instructions of GeoLDM to set up the datasets (i.e., QM9 and GEOM) used in the experiments
-
checkpoint
- unconditional model: we use the model checkpoints released by GeoLDM
- conditional model: we follow the instructions of GeoLDM to train conditional versions of GeoLDM and release the checkpoints at [here]. These directories should be put under
outputs
(e.g.,outputs/exp_cond_alpha
for GeoLDM conditioned on propertyalpha
) - regressor: we follow the instructions of GeoLDM to train property regressors / classifiers and release the checkpoints at [here]. These directories should be put under
regressors
(e.g.,regressors/exp_class_alpha
for regressor of propertyalpha
)
Explanation:
model_path
: specifies the backbone model to use (e.g., unconditional GeoLDM trained on QM9)guidance_step
: specifies the number of steps will be added with guidance until the last diffusion step (e.g., if total diffusion step is 1000 andguidance_step=400
, meaning the last 400 steps will be added with guidance)clf_scale
: the strength of the guidancepackage
: the chemistry software used for guidance, currently onlyc_line
/c_line_xxx
(e.g.,c_line_prop
for ChemGuide on molecular property besides force) is supported, in short of command line, which calls xTB using command linesnum_samples
: how many molecules to generate in one runevery_n_step
: specifies the frequency of the guidance, e.g.,every_n_step=2
means one guidance step will be added every two diffusion steps for the lastguidance_step
of the generation processseed
: sets the random seed of the run, default is None (i.e., no specific random seed is used)run_id
: creates a specific id for the current run, which will be used to save the generated molecules undereval/xxxx/seed{seed}/generated_mols/{run_id}
property
: specifies molecular properties other than force, can be{'alpha', 'mu', 'homo', 'lumo', 'gap', 'Cv'}
recurrent_times
: specifies how many times property guidance will be applied within one guidance stepbilevel_opti
: whether to use bilevel optimization for forces and other molecular propertiesbilevel_method
: which property guidance methods (noisy / clean) to use under the bilevel optimization frameworkclf_scale_force
&clf_scale_prop
: the guidance strength for force & property under the bilevel optimization framework--use_evolution
: whether to use evolution algorithm (to select molecules with desired properties)
For testing purpose (i.e., fast generation), we set the guidance_step=4
(in the paper it is 400), num_samples=3
, and for property guidance, we showcase property='alpha'
. One can specify a particular seed using --seed=xxx
and change --run_id='xxx'
for different runs
-
unconditional GeoLDM + force guidance (variables:
model_path
,guidance_step
,clf_scale
,num_samples
, suggested:every_n_step=1
)python eval_sample_xtb.py --model_path='outputs/qm9_latent2' --guidance_step=4 --clf_scale=0.01 --package='c_line' --num_samples=3 --every_n_step=1 --run_id='run0'
-
unconditional GeoLDM + property guidance
- noisy guidance (variables:
guidance_step
,clf_scale
,num_samples
,property
)python eval_sample_xtb.py --model_path='outputs/qm9_latent2' --guidance_step=4 --clf_scale=0.01 --package='c_line' --num_samples=3 --every_n_step=1 --run_id='run0' --use_neural_guidance --property='alpha'
- clean guidance (
clf_scale_force
should always be 0.0, variables:guidance_step
,clf_scale_prop
,num_samples
,property
)python eval_sample_xtb.py --model_path='outputs/qm9_latent2' --guidance_step=4 --package='c_line' --num_samples=3 --every_n_step=1 --run_id='run0' --bilevel_opti --property='alpha' --recurrent_times=1 --clf_scale_force=0.0 --clf_scale_prop=0.01
- noisy guidance (variables:
-
conditional GeoLDM + force guidance (variables:
model_path
,guidance_step
,clf_scale
,num_samples
,property
, whereproperty
should matchmodel_path
)python eval_sample_xtb.py --model_path='outputs/exp_cond_alpha' --guidance_step=4 --clf_scale=0.01 --package='c_line' --num_samples=3 --every_n_step=1 --run_id='run0' --use_cond_geo --property='alpha'
-
bilevel guidance (suggested:
every_n_step=1
,recurrent_times=1
&clf_scale_force=0.0001
)- noisy guidance (property) + ChemGuide (force) (variables:
guidance_step
,num_samples
,property
,clf_scale_prop
)python eval_sample_xtb.py --model_path='outputs/qm9_latent2' --guidance_step=4 --package='c_line' --num_samples=3 --every_n_step=1 --run_id='run0' --bilevel_opti --property='alpha' --recurrent_times=1 --clf_scale_force=0.0001 --clf_scale_prop=0.01 --bilevel_method='regular_RG'
- clean guidance (property) + ChemGuide (force) (variables:
guidance_step
,num_samples
,property
,clf_scale_prop
)python eval_sample_xtb.py --model_path='outputs/qm9_latent2' --guidance_step=4 --package='c_line' --num_samples=3 --every_n_step=1 --run_id='run0' --bilevel_opti --property='alpha' --recurrent_times=1 --clf_scale_force=0.0001 --clf_scale_prop=0.01 --bilevel_method='uni_guide'
- noisy guidance (property) + ChemGuide (force) (variables:
-
ChemGuide for property (variables:
model_path
,guidance_step
,clf_scale
,num_samples
,property
, suggested:every_n_step=1
)python eval_sample_xtb.py --model_path='outputs/qm9_latent2' --guidance_step=4 --clf_scale=0.01 --package='c_line_prop' --num_samples=3 --every_n_step=1 --run_id='run0' --property='alpha'
-
evolution algorithm (variables:
model_path
,num_samples
,num_beams
,check_variants_interval
, suggested:guidance_step=400
,clf_scale=0.0001
,every_n_step=1
)python eval_sample_xtb.py --model_path='outputs/qm9_latent2' --guidance_step=400 --clf_scale=0.0001 --package='c_line_evo' --num_samples=3 --every_n_step=1 --run_id='run0' --use_evolution --num_beams=3 --check_variants_interval=20
If you find our work interesting / useful, please consider citing our paper.
@inproceedings{shen_chemistry-inspired_2025,
title = {Chemistry-Inspired Diffusion with Non-Differentiable Guidance},
url = {https://openreview.net/forum?id=4dAgG8ma3B},
eventtitle = {The Thirteenth International Conference on Learning Representations},
author = {Shen, Yuchen and Zhang, Chenhao and Fu, Sijie and Zhou, Chenghui and Washburn, Newell and Poczos, Barnabas},
date = {2025-01-22},
langid = {english},
}