Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AmazonQA Data Size #65

Open
jasonwu0731 opened this issue Mar 24, 2020 · 2 comments
Open

AmazonQA Data Size #65

jasonwu0731 opened this issue Mar 24, 2020 · 2 comments

Comments

@jasonwu0731
Copy link

Hi,

I have downloaded the Amazon data (38 files) and ran the create_data.py by

python amazon_qa/create_data.py --file_pattern AmazonQA/* --output_dir AmazonQA/processed/ --runner DirectRunner --temp_location AmazonQA/processed/temp --staging_location AmazonQA/processed/staging --dataset_format JSON

It results in 100 train*.json and 100 test*.json under the AmazonQA/processed/ folder. After I read all the data, the training set has 158974 samples and the test set has 16763.

What is the number of samples you used in the paper? 3M or 158.9K? I am confused because it is different from the number listed in the repo.

P.S. I saw some filtering functions have been done in the create_data.py file.

Below are some statistics of the conversational dataset:

Input files: 38
Number of QA dictionaries: 1,569,513
Number of tuples: 4,035,625
Number of de-duplicated tuples: 3,689,912
Train set size: 3,316,905
Test set size: 373,007

Thank you in advance for your kind reply.

@matthen
Copy link
Contributor

matthen commented Mar 24, 2020

Hi Jason,
the training set size should be 3.3M. Maybe check there are indeed 38 input files? TOTAL: 38 objects, 1935927109 bytes (1.8 GiB)

I just re-ran the pipeline (with Google cloud DataflowRunner, and json output) and confirm these numbers. A quick check is wc -l data/test-00099-of-00100.json giving 3729.

@jasonwu0731
Copy link
Author

jasonwu0731 commented Mar 26, 2020

That's strange. I do have 38 files with around 1.8G. So is it the issue of using --runner DirectRunner?

When I ran wc AmazonQA/processed/test-00099-of-00100.json I got 167 6503 39507 AmazonQA/processed/test-00099-of-00100.json. Also I found that my AmazonQA/processed/ folder only has 41M.

Thanks for helping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants