Multimodal Transformer for Comics Text-Cloze

Authors: Emanuele Vivoli*, Joan Lafuente Baeza*, Ernest Valveny Llobet and Dimosthenis Karatzas
Contact: evivoli@cvc.uab.cat, joan.lafuente@autonoma.cat
The paper is available here for more information.

Example

In the next figure it can be seen two instances of the text-cloze task, as well as an example of the dialogue generation task.

Project Description

The purpose of this project is to perform a text-cloze task in the multimodal context of comics. To do this, we used the database provided in COMICS. We have also explored the posibility of dialogues generation.

Getting Started

Clone this repo (for help see this tutorial).

git clone https://github.com/joanlafuente/ComicVT5.git
cd ComicVT5

Download and prepare the dataset.

a. Download our preprocessed version of the original dataset as well as a version with textract OCR from here and place it in datasets/COMICS. If you would like to use the original dataset, you can download it from the COMICS repository.

Note that there are two files for the test set. The original authors filtered some dialogues from this set based on their tokenizer. To be able to compare the results of our model with the original ones, we maintain this filter in the test set. The "full" version, on the other hand, is not filtered.

b. Extract the visual features using one of the explained methods here, within this repo.
Install the dependencies.

# Create a conda environment with the depnedencies needed (optional but recommended)
conda env create -f environment.yml

[Optional] Download the pre-trained weights of the best performing models, on the easy and hard task, and CRN-Scratch from here. If you use the VL-T5 model, download the pretrained weights from their repository here.

Configuration

Every model, dataset, and trainer is configured in a configuration file. The configuration file is a YAML file. The configuration files are located in the configs folder. In case you want to add a new model, dataset, or trainer, you should create a new configuration file and add it to the configs folder, as well as the corresponding model or dataset script in src.

Dataset configuration

For each dataset, you need a configuration file in the configs/datasets folder. The file must contain the "name" parameter, which is the same as the name of the dataset script in src/datasets that will be used to load the dataset.

Model configuration

For each model, you need a configuration file in the configs/models folder. The name of the file must be the same as the name of the model script in src/models that will be used to load the model. The file must contain the the following parameters:

classname: <class name of the model>

tokenizer: <name of the tokenizer (we use the AutoTokenizer class from HuggingFace)>
# or
feature_extractor: <name of the feature extractor>

Trainer configuration

For the trainer, you need a configuration file in the configs/trainers folder. The file must contain the the following parameters:

epochs: <number of epochs>
runs_path: <path to the runs folder>
report_path: <path to the report folder>

optimizer:
    type: <type of optimizer>
    # ... parameters of the optimizer

The runs folder is where the training logs will be saved. The report folder is where the evaluation reports will be saved.

Code structure

# Store configuration
./config
    datasets/
    models/
    trainers/

# Create you own models or datasets
.src/
    models/                    <= This folder is where it is expected to add new models.
        base_model.py
    datasets/                  <= This folder is where it is expected to add new datasets.
        base_dataset.py

# Run the model
./main.py
# A version of "main.py" to evaluate dialogue generation models
./main_inf.py

# A notebook to generate examples of the text-cloze models
./plot_samples.ipynb 
# A notebook to generate examples of dialogue generation models
./plot_samples_gen.ipynb

Training and evaluation

To train the model, run the following command:

python main.py
  --mode "train"
  --model               Model to run
  --dataset_config      Dataset config to use
  --trainer_config      Trainer params to use
  --dataset_dir         Dataset directory path
  --load_checkpoint     Path to model checkpoint
  --batch_size          Batch size
  --seed                Seed to use

To evaluate the model, change the --mode to "eval".

Reference

@misc{vivoli2024multimodal,
      author = {Emanuele Vivoli and Joan Lafuente Baeza and Ernest Valveny Llobet and Dimosthenis Karatzas},
       title = {Multimodal Transformer for Comics Text-Cloze}, 
        year = {2024},
         url = {https://arxiv.org/pdf/2403.03719}
}

Acknowledgments

Sergi Masip Cabeza for providing the starting point of the code.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
configs		configs
src		src
tools		tools
.gitignore		.gitignore
README.md		README.md
enviroment.yml		enviroment.yml
main.py		main.py
main_inf.py		main_inf.py
plot_samples.ipynb		plot_samples.ipynb
plot_samples_gen.ipynb		plot_samples_gen.ipynb

joanlafuente/ComicVT5

Folders and files

Latest commit

History

Repository files navigation

Multimodal Transformer for Comics Text-Cloze

Example

Project Description

Getting Started

Configuration

Dataset configuration

Model configuration

Trainer configuration

Code structure

Training and evaluation

Reference

Acknowledgments

About

Topics

Resources

Stars

Watchers

Forks

Languages