I have 395 PNG files depicting numbers with commas. The images are 130x54 pixels and are black text on white background. Here is an example of an image showing the number 638,997: [image: 638,997.png] I would like to use Tesseract to perform reliable OCR on these images and others like them. Out-of-the-box, Tesseract correctly extracts text for 344 of these images, and fails in some manner on 51 of them. I am using the following command line for each image:
> tesseract --psm 7 --oem 1 -c tessedit_char_whitelist=',0123456789' {filename}.png out I run that command on each image, substituting {filename} as needed. Each invocation of that command produces the following output: Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica Warning: Invalid resolution 0 dpi. Using 70 instead. 344/395 is an 87% success rate, but I want to try for better. So, I am attempting to "fine-tune" Tesseract by running through the instructions for tesstrain at https://github.com/tesseract-ocr/tesstrain. Each of my PNG files have file names that indicate ground truth, and I have a little script that generates ground-truth TXT files from the PNG file names. I have chosen "swtor" as the model name. I can then run this command from the tesstrain root directory: $ make training MODEL_NAME=swtor START_MODEL=eng PSM=7 This command runs, prints lots of info, and eventually produces the following output, just before it ends: Finished! Error rate = 2.739 lstmtraining \ --stop_training \ --continue_from data/swtor/checkpoints/swtor_checkpoint \ --traineddata data/swtor/swtor.traineddata \ --model_output data/swtor.traineddata Loaded file data/swtor/checkpoints/swtor_checkpoint, unpacking... I can then take the resulting swtor.traineddata file, copy it to my tessdata directory, and then re-run my experiment from earlier, with a command line that looks like this: > tesseract -l swtor --psm 7 --oem 1 -c tessedit_char_whitelist= ',0123456789' {filename}.png out With the new swtor model, Tesseract correctly extracts text for 64 of these images, and fails in some manner on 331 of them. 64/395 is a 16% success rate, down from 87% for the eng model. So, the swtor model I trained does far worse, which I find surprising and unexpected. I think I might be doing something wrong but do not really know what next steps to take to continue troubleshooting this. I'm hoping to post here and get help from someone knowledgeable about the training process. I can post the contents of the "data" directory in my tesstrain repo root directory if that is helpful for anyone (I'd have to remove the checkpoints). -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a1ed5d91-6b2a-40c4-8eca-88cf6e7ebdd0n%40googlegroups.com.