You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a question about the pre-processing stage of protein data for Uni-Mol input (more specifically, tokenizing protein atoms).
As described in your paper and dictionary file you provided, allowed atom types for protein includes ['C', 'H', 'N', 'O', 'S'], and these five characters would be tokenized and passed to the embedding layers.
However, when I looked into the example processed data in './example_data/pocket/train.lmdb', protein atom symbols are saved as it annotated in the original pdb file with their own suffices such as 'CG1', 'HG21' and 'OE1'.
And all these unnormalized symbols are recognized as the UNK token when passed through the TokenizedDataset and passed to token embedding layer as input.
I think it's unappropriate since this can cause misunderstanding the atom type (for example, 'CG1' means the another carbon atoms in the residue so it should be recognized as 'C', not 'UNK'. But in the case I described, it's recognized as 'UNK').
I think there should be additional processing step for correct this kind of protein atom symbols.
If it's just a case you already implemented this step and I couldn't find, can you instruct me how?
Or if there's no another step for correcting, can you explain me why?
Thank you
The text was updated successfully, but these errors were encountered:
Bae-SungHan
changed the title
Preprocessing Uni-Mol protein data
Preprocessing Uni-Mol protein atoms
Mar 25, 2024
Bae-SungHan
changed the title
Preprocessing Uni-Mol protein atoms
Correting patoms
Mar 25, 2024
Thanks for sharing your awesome work.
I have a question about the pre-processing stage of protein data for Uni-Mol input (more specifically, tokenizing protein atoms).
As described in your paper and dictionary file you provided, allowed atom types for protein includes ['C', 'H', 'N', 'O', 'S'], and these five characters would be tokenized and passed to the embedding layers.
However, when I looked into the example processed data in './example_data/pocket/train.lmdb', protein atom symbols are saved as it annotated in the original pdb file with their own suffices such as 'CG1', 'HG21' and 'OE1'.
And all these unnormalized symbols are recognized as the UNK token when passed through the TokenizedDataset and passed to token embedding layer as input.
I think it's unappropriate since this can cause misunderstanding the atom type (for example, 'CG1' means the another carbon atoms in the residue so it should be recognized as 'C', not 'UNK'. But in the case I described, it's recognized as 'UNK').
I think there should be additional processing step for correct this kind of protein atom symbols.
If it's just a case you already implemented this step and I couldn't find, can you instruct me how?
Or if there's no another step for correcting, can you explain me why?
Thank you
The text was updated successfully, but these errors were encountered: