Skip to content

Machine Comprehension Train on MSMARCO with S-NET Extraction Modification

Notifications You must be signed in to change notification settings

zlsh80826/MSMARCO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

95 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MSMARCO with S-NET Extraction (Extraction-net)

Requirements

Here are some required libraries for training and evaluations.

General

  • python3.6
  • cuda-9.0 (CNTK required)
  • openmpi-1.10 (CNTK required)
  • gcc >= 6 (CNTK required)

Python

  • Please refer requirements.txt

Evaluate with pretrained model

This repo provides pretrained model and pre-processed validation dataset for testing the performance

Please download pretrained model and pre-processed data and put them on the MSMARCO/data and MSMARCO root directory respectively, then decompress them at the right places.

The code structure should be like

MSMARCO
├── data
│   ├── elmo_embedding.bin
│   ├── test.tsv
│   ├── vocabs.pkl
│   ├── data.tar.gz
│   └── ... others
├── model
│   ├── pm.model
│   ├── pm.model.ckp
│   └── pm.model_out.json
└── ... others

After decompressing,

cd Evaluation
sh eval.sh

then you should get the generated answer and rough-l score.

Usage

Preprocess

MSMARCO V1

Download MSMARCO v1 dataset, GloVe embedding.

cd data
python3.6 download.py v1

Convert raw data to tsv format

python3.6 convert_msmarco.py v1 --threads=`nproc` 

Convert tsv format to ctf(CNTK input) format and build vocabs dictionary

python3.6 tsv2ctf.py

Generate elmo embedding

sh elmo.sh

MSMARCO V2

Download MSMARCO v2 dataset, GloVe embedding.

cd data
python3.6 download.py v2

Convert raw data to tsv format

python3.6 convert_msmarco.py v2 --threads=`nproc`

Convert tsv format to ctf(CNTK input) format and build vocabs dictionary

python3.6 tsv2ctf.py

Generate elmo embedding

sh elmo.sh

Train (Same for V1 and V2)

cd ../script
mkdir log
sh run.sh

Evaluate develop dataset

MSMARCO V1

cd Evaluation
sh eval.sh v1

MSMARCO v2

cd Evaluation
sh eval.sh v2

Performance

Paper

rouge-l bleu_1
S-Net (Extraction) 41.45 44.08
S-Net (Extraction, Ensemble) 42.92 44.97

This implementation

rouge-l bleu_1
MSMARCO v1 w/o elmo 38.43 39.14
MSMARCO v1 w/ elmo 39.42 39.47
MSMARCO v2 w/ elmo 43.66 44.44

TODO

  • Multi-threads preprocessing
  • Elmo-Embedding
  • Evaluation script
  • MSMARCO v2 support
  • Reasonable metrics

About

Machine Comprehension Train on MSMARCO with S-NET Extraction Modification

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published