`Table Extraction from PDF scientific papers`

This library is the implementation of the paper Graph Neural Networks and Representation Embedding for Table Extraction in PDF Documents, accepted at ICPR2022.

Init and train

To reproduce paper code and training, follow the next steps:

Download the repo

git clone GNN-TableExtraction

Install the repo

cd GNN-TableExtraction && pip install -e .

Set root project path -> set root variable into src/utils/paths.py as /path/to/GNN-TableExtraction.
Download data and prepare them

python src/data/datasets_download.py

Build graph features and save them

python src/features/features_build.py

Then train the model

python src/models/model_train.py

Into run_multiple_train.sh you can find some examples of training commands to be used: complete documentations inside .yaml files in configs.

Project Organization

├── LICENSE
├── README.md          <- The top-level README for developers using this project.
├── data               <- Generated by script and downloading data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
|
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── setup.py           <- makes project pip installable (pip install -e .) so src can be imported

Project based on the cookiecutter data science project template. #cookiecutterdatascience

DATASETS

PubLayNet

For PubLayNet is composed by PDFs and Annotations. The following files are accessibe from:

PDF of document pages in PubLayNet are accessible here.
Annotations (labels) are available here; a file labels.tar composed by train.json and val.json is downloaded.

Save PDFs in /DATA/pdfs/ while Annotations in DATA/annotations/

Pubtables-1M

Getting the Data

PubTables-1M is available for download from Microsoft Research Open Data.

It comes in 5 tar.gz files:

[no] PubTables-1M-Image_Page_Detection_PASCAL_VOC.tar.gz [no] PubTables-1M-Image_Page_Words_JSON.tar.gz [maybe] PubTables-1M-Image_Table_Structure_PASCAL_VOC.tar.gz [maybe] PubTables-1M-Image_Table_Words_JSON.tar.gz [yes] PubTables-1M-PDF_Annotations_JSON.tar.gz

To download from the command line:

Visit the dataset home page with a web browser and click Download in the top left corner.
This will create a link to download the dataset from Azure with a unique access token for you that looks like https://msropendataset01.blob.core.windows.net/pubtables1m?[SAS_TOKEN_HERE].
You can then use the command line tool azcopy to download all of the files with the following command:

azcopy copy "https://msropendataset01.blob.core.windows.net/pubtables1m?[SAS_TOKEN_HERE]" "/path/to/your/download/folder/" --recursive

Then unzip each of the archives from the command line using:

tar -xzvf yourfile.tar.gz

Cite this project

If you want to use our code in your project(s), please cite us:

@INPROCEEDINGS{9956590,  
author={Gemelli, Andrea and Vivoli, Emanuele and Marinai, Simone},  
booktitle={2022 26th International Conference on Pattern Recognition (ICPR)},   
title={Graph Neural Networks and Representation Embedding for Table Extraction in PDF Documents},   
year={2022},  
volume={},  
number={},  
pages={1719-1726},  
doi={10.1109/ICPR56361.2022.9956590}}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configs

configs

src

src

.gitignore

.gitignore

LICENCE

LICENCE

README.md

README.md

requirements.txt

requirements.txt

run_multiple_train.sh

run_multiple_train.sh

setup.py

setup.py

test_environment.py

test_environment.py

Repository files navigation

`Table Extraction from PDF scientific papers`

Init and train

Project Organization

DATASETS

PubLayNet

Pubtables-1M

Getting the Data

To download from the command line:

Cite this project

About

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
configs		configs
src		src
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
requirements.txt		requirements.txt
run_multiple_train.sh		run_multiple_train.sh
setup.py		setup.py
test_environment.py		test_environment.py

License

AILab-UniFI/GNN-TableExtraction

Folders and files

Latest commit

History

Repository files navigation

Table Extraction from PDF scientific papers

Init and train

Project Organization

DATASETS

PubLayNet

Pubtables-1M

Getting the Data

To download from the command line:

Cite this project

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

`Table Extraction from PDF scientific papers`