Name	Name	Last commit message	Last commit date
parent directory ..
abinet	abinet
charset	charset
demo_imgs	demo_imgs
figures	figures
levt	levt
LICENSE.md	LICENSE.md
README.md	README.md
dataset.py	dataset.py
demo_imgs.py	demo_imgs.py
eval.py	eval.py
models.py	models.py
requirements.txt	requirements.txt
train_final_dist.py	train_final_dist.py
train_language_dist.py	train_language_dist.py
transforms.py	transforms.py
utils.py	utils.py
utils_dist.py	utils_dist.py

Levenshtein OCR

The official PyTorch implementation of LevOCR (ECCV 2022).

LevOCR can perform text sequence generation task and text sequence refinement task with the cross-modal fusion feature generated by Vision-Language Transformer (VLT) model. The refinement process is accomplished via two basic character-level operations: Deletion and Insertion, which are learned with Imitation Learning and allow for parallel decoding, dynamic length change and good interpretability. LevOCR exhibits the good interpretability and transparency in the inference phase, which could be very crucial for diagnosing and improving text recognition models in the future.

Paper

Install requirements

PyTorch version >= 1.8.0
Python version >= 3.6

pip3 install -r requirements.txt

For training new models, you need to install fairseq(We borrowed the parts of fairseq during training)

git clone https://github.com/pytorch/fairseq
cd fairseq
git checkout 0.12.2-release
pip install --editable ./
python setup.py build_ext --inplace

Dataset

Download lmdb dataset from Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition.

Training datasets
1. MJSynth (MJ):
  - Use tools/create_lmdb_dataset.py to convert images into LMDB dataset
  - LMDB dataset BaiduNetdisk(passwd:n23k)
2. SynthText (ST):
  - Use tools/crop_by_word_bb.py to crop images from original SynthText dataset, and convert images into LMDB dataset by tools/create_lmdb_dataset.py
  - LMDB dataset BaiduNetdisk(passwd:n23k)
3. Train_language:
  - This text dataset is only used for the pre-trainig of the language model.
  - It contains words from WikiText103, MJSynth and SynthText.
Evaluation datasets LMDB datasets can be downloaded from BaiduNetdisk(passwd:1dbv), GoogleDrive.
1. ICDAR 2013 (IC13)
2. ICDAR 2015 (IC15)
3. IIIT5K Words (IIIT)
4. Street View Text (SVT)
5. Street View Text-Perspective (SVTP)
6. CUTE80 (CUTE)
The structure of data folder as below.

data
├── evaluation
│   ├── CUTE80
│   ├── IC13_857
│   ├── IC15_1811
│   ├── IIIT5k_3000
│   ├── SVT
│   └── SVTP
├── training
│   ├── MJ
│   │   ├── MJ_test
│   │   ├── MJ_train
│   │   └── MJ_valid
│   ├── ST
│   └── train_language.txt

At this time, training datasets and evaluation datasets are LMDB datasets

Pretrained Models

Available model weights:

Language	Vision	LevOCR
Pretrain-language-model	Pretrain-vision-model	LevOCR-model

Benchmarks (Top 1% accuracy)

Performances of the reproduced pretrained models are summaried as follows:

Model	Iteration	IC13	SVT	IIIT	IC15	SVTP	CUTE	AVG
LevOCR-VP	-	95.8	92.4	95.4	84.5	84.6	88.8	91.2
LevOCR	#1	96.7	94.2	96.5	86.1	88.6	90.6	92.8
	#2	96.7	94.4	96.6	86.5	88.8	90.6	92.9
	#3	96.7	94.4	96.6	86.5	88.8	90.6	92.9

Run demo with pretrained model

Download pretrained model
Add image files to test into demo_imgs/
Run demo_imgs.py

python3 demo_imgs.py  --imgH 32 --imgW 128  --max_iter 2 --batch_size 16 --model_dir <path_to/model.pth> --rgb --th 0.5 --demo_imgs demo_imgs

Train

Pre-train language model

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 --master_port 29501  train_language_dist.py --train_data data/training/train_language.txt \
--valInterval 5000 --lr 0.3 --saved_path <path/to/save/dir> --exp_name levocr_pretrain_language --batch_size 512 --num_iter 2400000

Train LevOCR

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 --master_port 29501 train_final_dist.py --train_data data/training \ 
--valid_data data/evaluation --select_data MJ-ST --batch_ratio 0.5-0.5  --valInterval 5000 --lr 0.3 --rgb  \
--saved_path <path/to/save/dir> --exp_name levocr_32_128 --batch_size 32 --manualSeed 21223 --seed 223 --num_iter 2400000 \
--vis_model <path/to/pretrain-vision-model.pth> --levt_model <path/to/pretrain-language-model.pth>

Test

Find the path to best_accuracy.pth checkpoint file (usually in saved_path folder).

python3 eval.py  --eval_data data/evaluation --data_filtering_off --fast_acc --imgH 32 --imgW 128 --batch_size 128 --rgb --th 0.5 --max_iter 2 --model_dir <path_to/best_accuracy.pth>

Iterative Process

The detailed iterative process of LevOCR with different initial sequences on 6 public benchmarks.

Acknowledgements

This implementation has been based on these repository fairseq, CLOVA AI Deep Text Recognition Benchmark, ABINet.

Citation

If you find this work useful, please cite:

@inproceedings{ECCV2022LevOCR,
  title={Levenshtein OCR},
  author={Cheng Da, Peng Wang, and Cong Yao},
  booktitle = {ECCV},
  year={2022}
}

License

LevOCR is released under the terms of the Apache License, Version 2.0.

LevOCR is an algorithm for scene text recognition and the code and models herein created by the authors from Alibaba can only be used for research purpose.
Copyright (C) 1999-2022 Alibaba Group Holding Ltd. 

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Files

LevOCR

Directory actions

More options