Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correting patoms #210

Open
Bae-SungHan opened this issue Mar 25, 2024 · 1 comment
Open

Correting patoms #210

Bae-SungHan opened this issue Mar 25, 2024 · 1 comment

Comments

@Bae-SungHan
Copy link

Thanks for sharing your awesome work.

I have a question about the pre-processing stage of protein data for Uni-Mol input (more specifically, tokenizing protein atoms).
As described in your paper and dictionary file you provided, allowed atom types for protein includes ['C', 'H', 'N', 'O', 'S'], and these five characters would be tokenized and passed to the embedding layers.
However, when I looked into the example processed data in './example_data/pocket/train.lmdb', protein atom symbols are saved as it annotated in the original pdb file with their own suffices such as 'CG1', 'HG21' and 'OE1'.
And all these unnormalized symbols are recognized as the UNK token when passed through the TokenizedDataset and passed to token embedding layer as input.
I think it's unappropriate since this can cause misunderstanding the atom type (for example, 'CG1' means the another carbon atoms in the residue so it should be recognized as 'C', not 'UNK'. But in the case I described, it's recognized as 'UNK').
I think there should be additional processing step for correct this kind of protein atom symbols.

If it's just a case you already implemented this step and I couldn't find, can you instruct me how?
Or if there's no another step for correcting, can you explain me why?

Thank you

@Bae-SungHan Bae-SungHan changed the title Preprocessing Uni-Mol protein data Preprocessing Uni-Mol protein atoms Mar 25, 2024
@Bae-SungHan Bae-SungHan changed the title Preprocessing Uni-Mol protein atoms Correting patoms Mar 25, 2024
@ZhouGengmo
Copy link
Contributor

Thank you for your interest in our work.
Before the data is fed into the model, we have corresponding processing. You can refer to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants