Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bert Data for Pretraining: No such file or directory: 'bert_data/validation_512_only' #29

Open
nigaregr opened this issue Sep 18, 2019 · 5 comments

Comments

@nigaregr
Copy link

nigaregr commented Sep 18, 2019

Hi, I have Pretraining running but it fails after 1st Epoch with the following error:
File "/AzureML-BERT/pretrain/PyTorch/dataset.py", line 100, in init
path = get_random_partition(self.dir_path, index)
File "
/AzureML-BERT/pretrain/PyTorch/dataset.py", line 33, in get_random_partition
for x in os.listdir(data_directory)]
FileNotFoundError: [Errno 2] No such file or directory: 'bert_data/validation_512_only'

I have the created the Wiki pretraining data using create_pretraining script. I do not see validation_512_only being generated?

@nigaregr nigaregr changed the title Bert Data for Pretraining Bert Data for Pretraining: No such file or directory: 'bert_data/validation_512_only' Sep 18, 2019
@kishorepv
Copy link

I think you should create another subfolder in bert_data/validation_512_only with the validation data (i.e .bin files generated by create_pretraining) in it

@skaarthik
Copy link
Contributor

Thanks @nigaregr for reporting this.
@jingyanwangms can you update the tar file mentioned in https://github.com/microsoft/AzureML-BERT/blob/master/docs/artifacts.md#preprocessed-data with the newly generated wikipedia dataset and the validation folder?

@usuyama
Copy link

usuyama commented Jan 6, 2020

For now I created bert_data/validation_512_only folder and moved wikipedia_segmented_part_98.bin and it seems the training pipeline is working fine.

Still would be great to use the updated files @jingyanwangms

@Howal
Copy link

Howal commented Apr 13, 2020

Hi @skaarthik, have you decided to update the zip-dataset or the data prep instruction?
Besides, I wonder what if I did as @usuyama suggested? Will there be any performance influence/drop?
Thanks!

@skaarthik
Copy link
Contributor

Hi @Howal, what @usuyama did is a reasonable workaround in the absence of some other validation set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants