Correting patoms #210

Bae-SungHan · 2024-03-25T04:55:20Z

Thanks for sharing your awesome work.

I have a question about the pre-processing stage of protein data for Uni-Mol input (more specifically, tokenizing protein atoms).
As described in your paper and dictionary file you provided, allowed atom types for protein includes ['C', 'H', 'N', 'O', 'S'], and these five characters would be tokenized and passed to the embedding layers.
However, when I looked into the example processed data in './example_data/pocket/train.lmdb', protein atom symbols are saved as it annotated in the original pdb file with their own suffices such as 'CG1', 'HG21' and 'OE1'.
And all these unnormalized symbols are recognized as the UNK token when passed through the TokenizedDataset and passed to token embedding layer as input.
I think it's unappropriate since this can cause misunderstanding the atom type (for example, 'CG1' means the another carbon atoms in the residue so it should be recognized as 'C', not 'UNK'. But in the case I described, it's recognized as 'UNK').
I think there should be additional processing step for correct this kind of protein atom symbols.

If it's just a case you already implemented this step and I couldn't find, can you instruct me how?
Or if there's no another step for correcting, can you explain me why?

Thank you

ZhouGengmo · 2024-03-25T18:12:51Z

Thank you for your interest in our work.
Before the data is fed into the model, we have corresponding processing. You can refer to this.

Bae-SungHan changed the title ~~Preprocessing Uni-Mol protein data~~ Preprocessing Uni-Mol protein atoms Mar 25, 2024

Bae-SungHan changed the title ~~Preprocessing Uni-Mol protein atoms~~ Correting patoms Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correting patoms #210

Correting patoms #210

Bae-SungHan commented Mar 25, 2024

ZhouGengmo commented Mar 25, 2024

Correting patoms #210

Correting patoms #210

Comments

Bae-SungHan commented Mar 25, 2024

ZhouGengmo commented Mar 25, 2024