Fine Tuning Tesseract on Windows

Install Tesseract OCR and tool

Download the setup from this repo and make sure tesseract is in your PATH.
Install Python version 3.x
Install Git
Install winget/Windows Package Manager and then run winget install ezwinports.make and winget install wget.

Clone repo `tesstrain`

git clone https://github.com/tesseract-ocr/tesstrain.git

and clone the trained model repo

git clone https://github.com/tesseract-ocr/tessdata_best.git

also clone the langdata cause we will need the wordlist, punctation and number files

git clone https://github.com/tesseract-ocr/langdata

or download the trained model you are interested to fine tune in a folder tessdata_best

cd tesstrain

Create virtual environment

Create virtual environment

python -m venv .venv

Activate virtual environment

.venv\scripts\activate

Install requirements

python -m pip install -r requirements.txt

Create Folders

Create data folder

Open Git Bash and cd into the directory tesstrain and run this:

>>> DO NOT NEED THIS

make tesseract-langdata

~~in the data a langdata folder is create with a bunch of unicharset files~~

Ground Truth Data

Tesseract 5 using lines of data so we need to provide a image with the line (png or tif) and a text file with the content of the image. You can find a ZIP file ocrd-testset.zip with some ground truth data we can use to fine tuning.

unzip the file in a folder inside the data folder giving the name of the model you are going to create + ground-truth

IE: lft-ground-truth

Fine tuning

using this command to start the training process

make training MODEL_NAME=lft START_MODEL=eng TESSDATA=../tessdata_best LANGDATA_DIR=../langdata LEARNING_RATE=0.001 RATIO_TRAIN=0.80 MAX_ITERATIONS=5000

the fine tuning process will start and a folder with your model name will be created inside the data folder with the name of your model.

In the ground truth folder lft-ground-truth the box files will be created for each image.

At the end of the training process you can

make traineddata MODEL_NAME=lft

make plot MODEL_NAME=lft

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fine Tuning Tesseract on Windows

Install Tesseract OCR and tool

Clone repo `tesstrain`

Create virtual environment

Create Folders

Ground Truth Data

Fine tuning

About

Uh oh!

Releases

Packages

Leftyx/tesseract-fine-tuning

Folders and files

Latest commit

History

Repository files navigation

Fine Tuning Tesseract on Windows

Install Tesseract OCR and tool

Clone repo tesstrain

Create virtual environment

Create Folders

Ground Truth Data

Fine tuning

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Clone repo `tesstrain`

Packages