Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RoBERTa Corpus #181

Open
stephencurry-web opened this issue Apr 1, 2024 · 1 comment
Open

RoBERTa Corpus #181

stephencurry-web opened this issue Apr 1, 2024 · 1 comment

Comments

@stephencurry-web
Copy link

RoBERTa Corpus is a combination of multiple sources, did not perform any form of filtering?
The bookcorpus dataset alone has 74M rows, but I saw that your Roberta folder is named 20M. May I ask what rules you use to filter the final data. I hope to receive your detailed description or if it is possible to publicly disclose your Roberta training data. Thank you for your help.

@t1101675
Copy link
Contributor

t1101675 commented Apr 2, 2024

We didn't perform data filtering for the corpus. We construct the data by

  1. Combine these sources.
  2. Shuffle the documents.
  3. Tokenize them into chunks with 512 tokens.
  4. Split the first 20M chunks for training (in practice, we stopped tokenization until the tokenized data contains 20M chunks)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants