- Download the setup from this repo
and make sure tesseract is in your
PATH
. - Install Python version 3.x
- Install Git
- Install winget/Windows Package Manager and then run
winget install ezwinports.make
andwinget install wget
.
git clone https://github.com/tesseract-ocr/tesstrain.git
and clone the trained model repo
git clone https://github.com/tesseract-ocr/tessdata_best.git
also clone the langdata
cause we will need the wordlist, punctation and number files
git clone https://github.com/tesseract-ocr/langdata
or download the trained model you are interested to fine tune in a folder tessdata_best
cd tesstrain
- Create virtual environment
python -m venv .venv
Activate virtual environment
.venv\scripts\activate
Install requirements
python -m pip install -r requirements.txt
Create data
folder
Open Git Bash and cd into the directory tesstrain
and run this:
>>> DO NOT NEED THIS
make tesseract-langdata
in the data
a langdata
folder is create with a bunch of unicharset files
Tesseract 5 using lines of data so we need to provide a image with the line (png or tif)
and a text file with the content of the image.
You can find a ZIP file ocrd-testset.zip
with some ground truth data we can use to fine tuning.
- unzip the file in a folder inside the
data
folder giving the name of the model you are going to create +ground-truth
IE: lft-ground-truth
using this command to start the training process
make training MODEL_NAME=lft START_MODEL=eng TESSDATA=../tessdata_best LANGDATA_DIR=../langdata LEARNING_RATE=0.001 RATIO_TRAIN=0.80 MAX_ITERATIONS=5000
the fine tuning process will start and a folder with your model name will be created inside the data
folder with the name of your model.
In the ground truth folder lft-ground-truth
the box files will be created for each image.
At the end of the training process you can
make traineddata MODEL_NAME=lft
make plot MODEL_NAME=lft