Skip to content

Code for ICPR2022 paper: "Graph Neural Networks and Representation Embedding for table extraction in PDF Documents"

License

Notifications You must be signed in to change notification settings

AILab-UniFI/GNN-TableExtraction

Repository files navigation

Table Extraction from PDF scientific papers

This library is the implementation of the paper Graph Neural Networks and Representation Embedding for Table Extraction in PDF Documents, accepted at ICPR2022.

Init and train

To reproduce paper code and training, follow the next steps:

  • Download the repo
git clone GNN-TableExtraction
  • Install the repo
cd GNN-TableExtraction && pip install -e .
  • Set root project path -> set root variable into src/utils/paths.py as /path/to/GNN-TableExtraction.
  • Download data and prepare them
python src/data/datasets_download.py
  • Build graph features and save them
python src/features/features_build.py
  • Then train the model
python src/models/model_train.py

Into run_multiple_train.sh you can find some examples of training commands to be used: complete documentations inside .yaml files in configs.

Project Organization


├── LICENSE
├── README.md          <- The top-level README for developers using this project.
├── data               <- Generated by script and downloading data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
|
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── setup.py           <- makes project pip installable (pip install -e .) so src can be imported

Project based on the cookiecutter data science project template. #cookiecutterdatascience

DATASETS

PubLayNet

For PubLayNet is composed by PDFs and Annotations. The following files are accessibe from:

  • PDF of document pages in PubLayNet are accessible here.
  • Annotations (labels) are available here; a file labels.tar composed by train.json and val.json is downloaded.

Save PDFs in /DATA/pdfs/ while Annotations in DATA/annotations/

Pubtables-1M

Getting the Data

PubTables-1M is available for download from Microsoft Research Open Data.

It comes in 5 tar.gz files:

[no] PubTables-1M-Image_Page_Detection_PASCAL_VOC.tar.gz [no] PubTables-1M-Image_Page_Words_JSON.tar.gz [maybe] PubTables-1M-Image_Table_Structure_PASCAL_VOC.tar.gz [maybe] PubTables-1M-Image_Table_Words_JSON.tar.gz [yes] PubTables-1M-PDF_Annotations_JSON.tar.gz

To download from the command line:

  • Visit the dataset home page with a web browser and click Download in the top left corner.
  • This will create a link to download the dataset from Azure with a unique access token for you that looks like https://msropendataset01.blob.core.windows.net/pubtables1m?[SAS_TOKEN_HERE].
  • You can then use the command line tool azcopy to download all of the files with the following command:
azcopy copy "https://msropendataset01.blob.core.windows.net/pubtables1m?[SAS_TOKEN_HERE]" "/path/to/your/download/folder/" --recursive

Then unzip each of the archives from the command line using:

tar -xzvf yourfile.tar.gz

Cite this project

If you want to use our code in your project(s), please cite us:

@INPROCEEDINGS{9956590,  
author={Gemelli, Andrea and Vivoli, Emanuele and Marinai, Simone},  
booktitle={2022 26th International Conference on Pattern Recognition (ICPR)},   
title={Graph Neural Networks and Representation Embedding for Table Extraction in PDF Documents},   
year={2022},  
volume={},  
number={},  
pages={1719-1726},  
doi={10.1109/ICPR56361.2022.9956590}}