Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocessed data are in the wrong path 512/wikipedia_pretrain. #27

Open
kaiidams opened this issue Sep 10, 2019 · 2 comments
Open

Preprocessed data are in the wrong path 512/wikipedia_pretrain. #27

kaiidams opened this issue Sep 10, 2019 · 2 comments

Comments

@kaiidams
Copy link

BERT_pretrain.ipynb instructs to download https://bertonazuremlwestus2.blob.core.windows.net/public/bert_data.tar.gz for the preprocessed data. The tar file contains data in 512/wikipedia_pretrain, but it should be 512/wiki_pretrain.

@kaiidams
Copy link
Author

The serialized data wikipedia_segment ed_part_NN.bin refer WikiNBookCorpusPretrainingDataCreator which has been deleted in the latest code. Adding the following can avoid the issue.

class WikiNBookCorpusPretrainingDataCreator(PretrainingDataCreator):
    pass

@skaarthik
Copy link
Contributor

@kaiidams thanks for reporting this issue. We will update the tar file soon. In the meantime, download and use the data referenced in https://github.com/microsoft/AzureML-BERT/blob/master/docs/artifacts.md#preprocessed-data and you will not need the deleted file for loading the data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants