Parallel processing doesn't work when generating training datasets using get_3d_lmdb.py #176

congffu · 2023-11-09T07:48:30Z

Hi authors,

Thanks for the great work. I ran into some issues when following the readme of Uni-Mol+ to generate the training dataset. And I hope to get some advice on that.

When I run python ../get_3d_lmdb.py train, it needs to take ~250 hrs (according to tqdm) to finish the dataset generation (number of CPU cores on our machine is 112). Then I used 10 molecules to test the speed and I found the speed of sequential processing and parallel processing is about the same. Then I narrow it down to the function rdkit_3d_gen (shown below), which blocks the speed up in multiprocessing.

def rdkit_3d_gen(smile, seed):
    mol = read_smiles(smile)
    AllChem.EmbedMolecule(mol, randomSeed=seed, maxAttempts=1000)
    mol = rdkit_mmff(mol)
    pos = mol.GetConformer().GetPositions()
    return mol

If I comment out AllChem.EmbedMolecule(mol, randomSeed=seed, maxAttempts=1000), the speed of parallel processing can become normal.

I appreciate it if any suggestions on how to fix this issue. Looking forward to hearing from you.

Thank you

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel processing doesn't work when generating training datasets using get_3d_lmdb.py #176

Parallel processing doesn't work when generating training datasets using get_3d_lmdb.py #176

congffu commented Nov 9, 2023

Parallel processing doesn't work when generating training datasets using get_3d_lmdb.py #176

Parallel processing doesn't work when generating training datasets using get_3d_lmdb.py #176

Comments

congffu commented Nov 9, 2023