I have 395 PNG files depicting numbers with commas. The images are 130x54 
pixels and are black text on white background. Here is an example of an 
image showing the number 638,997:
[image: 638,997.png]
I would like to use Tesseract to perform reliable OCR on these images and 
others like them. Out-of-the-box, Tesseract correctly extracts text for 344 
of these images, and fails in some manner on 51 of them. I am using the 
following command line for each image:

> tesseract --psm 7 --oem 1 -c tessedit_char_whitelist=',0123456789' 
{filename}.png out

I run that command on each image, substituting {filename} as needed. Each 
invocation of that command produces the following output:

Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.

344/395 is an 87% success rate, but I want to try for better. So, I am 
attempting to "fine-tune" Tesseract by running through the instructions for 
tesstrain at https://github.com/tesseract-ocr/tesstrain. Each of my PNG 
files have file names that indicate ground truth, and I have a little 
script that generates ground-truth TXT files from the PNG file names. I 
have chosen "swtor" as the model name. I can then run this command from the 
tesstrain root directory:

$ make training MODEL_NAME=swtor START_MODEL=eng PSM=7

This command runs, prints lots of info, and eventually produces the 
following output, just before it ends:

Finished! Error rate = 2.739
lstmtraining \
--stop_training \
--continue_from data/swtor/checkpoints/swtor_checkpoint \
--traineddata data/swtor/swtor.traineddata \
--model_output data/swtor.traineddata
Loaded file data/swtor/checkpoints/swtor_checkpoint, unpacking...

I can then take the resulting swtor.traineddata file, copy it to my 
tessdata directory, and then re-run my experiment from earlier, with a 
command line that looks like this:

> tesseract -l swtor --psm 7 --oem 1 -c tessedit_char_whitelist=
',0123456789' {filename}.png out

With the new swtor model, Tesseract correctly extracts text for 64 of these 
images, and fails in some manner on 331 of them.
64/395 is a 16% success rate, down from 87% for the eng model.
So, the swtor model I trained does far worse, which I find surprising and 
unexpected. I think I might be doing something wrong but do not really know 
what next steps to take to continue troubleshooting this. I'm hoping to 
post here and get help from someone knowledgeable about the training 
process.

I can post the contents of the "data" directory in my tesstrain repo root 
directory if that is helpful for anyone (I'd have to remove the 
checkpoints).

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a1ed5d91-6b2a-40c4-8eca-88cf6e7ebdd0n%40googlegroups.com.

Reply via email to