LLaVA OCR Script (ocr_Llava.py)
This script utilizes the LLaVA library to process images using a pre-trained model. It automates the process of extracting text from images and interacts with the LLaVA terminal interface. The script assumes a specific workflow for processing a list of images.
Before running the script, ensure the following dependencies are installed:
# Clone the LLaVA repository
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
# Create and activate a virtual environment
conda create -n llava python=3.10 -y
conda activate llava
# Install required packages
pip install --upgrade pip
pip install -e .
pip install protobuf
pip install --upgrade transformers
#If using llava-v1.6-vicuna-7b, additional steps may be required:
# Navigate to the cloned LLaVA repository
cd LLaVA
# Update the repository
git pull
# Reinstall dependencies
pip install -e .
pip uninstall psutil
pip install psutil
-
Place the script in the 'LLaVA\llava\serve' folder.
-
Update the
image_dir
variable in the script to the path containing the images to be processed. -
Run the following command in the terminal:
python -m llava.serve.tcli --model-path liuhaotian/llava-v1.6-vicuna-7b --load-4bit
Note: If errors occur, consider changing time.sleep(35) to time.sleep(40).
-
The script collects a list of images from the specified directory and sorts them.
-
It starts the LLaVA terminal command using
subprocess.Popen
. -
For each image in the list:
a. It waits for the 'Image path:' prompt, then sends the image path to the LLaVA terminal.
b. It waits for the 'USER:' prompt and sends the command 'Extract text in the image.'
c. It waits for 35 seconds to allow processing.
d. It prints and saves the 'ASSISTANT:' output.
e. It waits for 2 seconds before moving to the next image.
-
The script logs the progress in the terminal and saves it to the 'output_llava.txt' file.
-
The process is terminated after processing all images, and the total time is displayed.
Note: The script assumes the LLaVA terminal interface follows specific prompts ('Image path:', 'USER:', 'ASSISTANT:'). Adjustments may be needed based on updates to LLaVA.
The prompt used to create the llava script can be found in the jupyter notebook Prompt_GPT4_Llava.ipynb
Image Text Extraction with Pytesseract (ocr_pytesseract.py)
The script is designed to process a folder of images, extract text from each image, rotate the image at 90-degree intervals up to 270 degrees, and extract text from the rotated images. The results are then compiled into a Microsoft Word document, with images, names, rotations, and extracted texts.
Ensure the following packages are installed before running the script:
opencv-python
: Image processing librarypytesseract
: OCR (Optical Character Recognition) toolPillow
: Image processing library for opening, manipulating, and saving many different image file formatstqdm
: Progress bar for iterationpython-docx
: Library for creating and updating Word documents- All the requirements can be found in the file requirements.txt and the sudo commands that should be installed can be found in this file
- Additional system dependencies and commands are provided in the comments at the beginning of the script.
- These include updating the system, installing necessary libraries, setting up Tesseract OCR, and installing required Python packages.
- Takes an image and a rotation angle in degrees.
- Applies rotation to the image using OpenCV's
warpAffine
function. - Returns the rotated image.
- Takes an image and a scale factor.
- Resizes the image while maintaining the original aspect ratio.
- Returns the resized image.
- Uses pytesseract to extract text from an image.
- Returns the extracted text.
- Takes a Word document (
doc
), image path, image name, extracted text, rotated image, and rotation angle. - Sanitizes text to remove problematic characters.
- Adds a paragraph to the document with image name, rotation angle, and sanitized text.
- Resizes the rotated image, converts it to a format suitable for Word (
PIL
toBytesIO
), and adds it to the document. - Adds a line break between images.
- Accepts a folder path as a command-line argument.
- Verifies the existence of the folder.
- Gets a list of image files in the folder.
- Initializes a Word document.
- Iterates through each image, rotating it at 90-degree intervals and extracting text.
- Saves the Word document with processed information on the desktop.
- The script can be executed from the command line with the folder path as an argument.
- Example:
python script.py /path/to/image/folder
- The script may display an error message related to XML compatibility. It handles this issue by modifying the
add_image_to_doc
function. - The script saves the Word document on the desktop with the folder name and "_word.docx" appended.
- Images are not cropped to fit the original dimensions but are adapted to their new size when added to the Word document.
- The script provides a progress bar using
tqdm
. - The execution time is printed at the end of the script.