RoBERTa Corpus #181

stephencurry-web · 2024-04-01T07:30:45Z

RoBERTa Corpus is a combination of multiple sources, did not perform any form of filtering?
The bookcorpus dataset alone has 74M rows, but I saw that your Roberta folder is named 20M. May I ask what rules you use to filter the final data. I hope to receive your detailed description or if it is possible to publicly disclose your Roberta training data. Thank you for your help.

t1101675 · 2024-04-02T02:44:11Z

We didn't perform data filtering for the corpus. We construct the data by

Combine these sources.
Shuffle the documents.
Tokenize them into chunks with 512 tokens.
Split the first 20M chunks for training (in practice, we stopped tokenization until the tokenized data contains 20M chunks)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RoBERTa Corpus #181

RoBERTa Corpus #181

stephencurry-web commented Apr 1, 2024

t1101675 commented Apr 2, 2024

RoBERTa Corpus #181

RoBERTa Corpus #181

Comments

stephencurry-web commented Apr 1, 2024

t1101675 commented Apr 2, 2024