Jatikarok and BanglaPRCorpus

Advancing Bangla Punctuation Restoration by a Monolingual Transformer-Based Method and a Large-Scale Corpus
[Accepted at EMNLP 2023 Workshop BLP, Paper — Link will be updated]

Jatikarok in a Nutshell

BanglaPRCorpus Statistic

The Bangla punctuation restoration corpus, christened as BanglaPRcorpus, is constituted by 1.48 million source-target pairs. Within these pairs, the omission of punctuation from source sentences is conspicuous, while the target sentences epitomize the rectified versions where the supplementation of missing punctuation is executed. The process of correction entails the methodical removal of punctuation marks across the sentences, spanning a spectrum of quantities, ranging from 1 to 10, within each sentence. Moreover, it is of significance to underscore that the sentences within our corpus manifest a divergence in length, with the minimum sentence being characterized by a mere 2 words, the maximum sentence expanding to a substantial 127 words, and the average sentence length averaging at 12.9 words.

Get Started

Clone the GitHub repository of the paper.

git clone https://github.com/mehedihasanbijoy/Jatikarok-and-BanglaPRCorpus.git

Alternatively, you can manually download and extract the GitHub repository of Jatikarok-and-BanglaPRCorpus.

Environment Setup

Install the required packages.

conda env create -f requirements.yml

Afterward, activate the virtual environment and navigate to the paper directory.

conda activate jatikarok
cd Jatikarok and BanglaPRCorpus

Download the BanglaPRCorpus

gdown https://drive.google.com/drive/folders/1V1OrkJ4okSgw5swmhrbXAZFqkDB8g7QX?usp=share_link -O ./BanglaPRCorpus/BanglaPRCorpus/ --folder

or manually download the folder from here and keep the extracted files into ./BanglaPRCorpus/BanglaPRCorpus/

Regenerate the BanglaPRCorpus

Go to ./BanglaPRCorpus directory and follow the instructions.

Train/Validate/Evaluate the Model

The experiments in this paper involves benchmarking three methods, namely Jatikarok, BanglaT5, and T5 Small, on three different corpora, including BanglaPRCorpus, ProthomAloBalanced, and BanglaOPUS.

To train, validate, and evaluate Jatikarok on BanglaPRCorpus

python main.py --CORPUS_PATH "./BanglaPRCorpus/BanglaPRCorpus/corpus.csv" --KNOWLEDGE_PATH "./KnowledgeToBeTransferred/gecJatikarok.pth" --CHECKPOINT_PATH "./ModelCheckpoints/prJatikarok.pth" --MODEL_NAME "jatikarok" --BATCH_SIZE 16 --N_EPOCHS 50

Benchmarking Bangla Punctuation Restoration Task

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
BanglaPRCorpus		BanglaPRCorpus
KnowledgeToBeTransferred		KnowledgeToBeTransferred
ModelCheckpoints		ModelCheckpoints
LICENSE		LICENSE
README.md		README.md
data_loader.py		data_loader.py
eval_report.py		eval_report.py
main.py		main.py
train_eval.py		train_eval.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BanglaPRCorpus

BanglaPRCorpus

KnowledgeToBeTransferred

KnowledgeToBeTransferred

ModelCheckpoints

ModelCheckpoints

LICENSE

LICENSE

README.md

README.md

data_loader.py

data_loader.py

eval_report.py

eval_report.py

main.py

main.py

train_eval.py

train_eval.py

utils.py

utils.py

Repository files navigation

Jatikarok and BanglaPRCorpus

Jatikarok in a Nutshell

BanglaPRCorpus Statistic

Get Started

Environment Setup

Download the BanglaPRCorpus

Regenerate the BanglaPRCorpus

Train/Validate/Evaluate the Model

To train, validate, and evaluate Jatikarok on BanglaPRCorpus

Benchmarking Bangla Punctuation Restoration Task

About

Languages

License

mehedihasanbijoy/Jatikarok-and-BanglaPRCorpus

Folders and files

Latest commit

History

Repository files navigation

Jatikarok and BanglaPRCorpus

Jatikarok in a Nutshell

BanglaPRCorpus Statistic

Get Started

Environment Setup

Download the BanglaPRCorpus

Regenerate the BanglaPRCorpus

Train/Validate/Evaluate the Model

To train, validate, and evaluate Jatikarok on BanglaPRCorpus

Benchmarking Bangla Punctuation Restoration Task

About

Topics

Resources

License

Stars

Watchers

Forks

Languages