AmazonQA Data Size #65

jasonwu0731 · 2020-03-24T09:45:51Z

Hi,

I have downloaded the Amazon data (38 files) and ran the create_data.py by

python amazon_qa/create_data.py --file_pattern AmazonQA/* --output_dir AmazonQA/processed/ --runner DirectRunner --temp_location AmazonQA/processed/temp --staging_location AmazonQA/processed/staging --dataset_format JSON

It results in 100 train*.json and 100 test*.json under the AmazonQA/processed/ folder. After I read all the data, the training set has 158974 samples and the test set has 16763.

What is the number of samples you used in the paper? 3M or 158.9K? I am confused because it is different from the number listed in the repo.

P.S. I saw some filtering functions have been done in the create_data.py file.

Below are some statistics of the conversational dataset:

Input files: 38
Number of QA dictionaries: 1,569,513
Number of tuples: 4,035,625
Number of de-duplicated tuples: 3,689,912
Train set size: 3,316,905
Test set size: 373,007

Thank you in advance for your kind reply.

The text was updated successfully, but these errors were encountered:

matthen · 2020-03-24T10:50:47Z

Hi Jason,
the training set size should be 3.3M. Maybe check there are indeed 38 input files? TOTAL: 38 objects, 1935927109 bytes (1.8 GiB)

I just re-ran the pipeline (with Google cloud DataflowRunner, and json output) and confirm these numbers. A quick check is wc -l data/test-00099-of-00100.json giving 3729.

jasonwu0731 · 2020-03-26T07:19:55Z

That's strange. I do have 38 files with around 1.8G. So is it the issue of using --runner DirectRunner?

When I ran wc AmazonQA/processed/test-00099-of-00100.json I got 167 6503 39507 AmazonQA/processed/test-00099-of-00100.json. Also I found that my AmazonQA/processed/ folder only has 41M.

Thanks for helping.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AmazonQA Data Size #65

AmazonQA Data Size #65

jasonwu0731 commented Mar 24, 2020

matthen commented Mar 24, 2020

jasonwu0731 commented Mar 26, 2020 •

edited

AmazonQA Data Size #65

AmazonQA Data Size #65

Comments

jasonwu0731 commented Mar 24, 2020

matthen commented Mar 24, 2020

jasonwu0731 commented Mar 26, 2020 • edited

jasonwu0731 commented Mar 26, 2020 •

edited