Please share your training data so that we can test. Thanks. <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> Virus-free. www.avg.com <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
On Sat, Sep 19, 2020 at 11:01 AM Gradalajage <kes...@gmail.com> wrote: > I have 395 PNG files depicting numbers with commas. The images are 130x54 > pixels and are black text on white background. Here is an example of an > image showing the number 638,997: > [image: 638,997.png] > I would like to use Tesseract to perform reliable OCR on these images and > others like them. Out-of-the-box, Tesseract correctly extracts text for 344 > of these images, and fails in some manner on 51 of them. I am using the > following command line for each image: > > > tesseract --psm 7 --oem 1 -c tessedit_char_whitelist=',0123456789' > {filename}.png out > > I run that command on each image, substituting {filename} as needed. Each > invocation of that command produces the following output: > > Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica > Warning: Invalid resolution 0 dpi. Using 70 instead. > > 344/395 is an 87% success rate, but I want to try for better. So, I am > attempting to "fine-tune" Tesseract by running through the instructions for > tesstrain at https://github.com/tesseract-ocr/tesstrain. Each of my PNG > files have file names that indicate ground truth, and I have a little > script that generates ground-truth TXT files from the PNG file names. I > have chosen "swtor" as the model name. I can then run this command from the > tesstrain root directory: > > $ make training MODEL_NAME=swtor START_MODEL=eng PSM=7 > > This command runs, prints lots of info, and eventually produces the > following output, just before it ends: > > Finished! Error rate = 2.739 > lstmtraining \ > --stop_training \ > --continue_from data/swtor/checkpoints/swtor_checkpoint \ > --traineddata data/swtor/swtor.traineddata \ > --model_output data/swtor.traineddata > Loaded file data/swtor/checkpoints/swtor_checkpoint, unpacking... > > I can then take the resulting swtor.traineddata file, copy it to my > tessdata directory, and then re-run my experiment from earlier, with a > command line that looks like this: > > > tesseract -l swtor --psm 7 --oem 1 -c tessedit_char_whitelist= > ',0123456789' {filename}.png out > > With the new swtor model, Tesseract correctly extracts text for 64 of > these images, and fails in some manner on 331 of them. > 64/395 is a 16% success rate, down from 87% for the eng model. > So, the swtor model I trained does far worse, which I find surprising and > unexpected. I think I might be doing something wrong but do not really know > what next steps to take to continue troubleshooting this. I'm hoping to > post here and get help from someone knowledgeable about the training > process. > > I can post the contents of the "data" directory in my tesstrain repo root > directory if that is helpful for anyone (I'd have to remove the > checkpoints). > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/a1ed5d91-6b2a-40c4-8eca-88cf6e7ebdd0n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/a1ed5d91-6b2a-40c4-8eca-88cf6e7ebdd0n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> Virus-free. www.avg.com <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWmOLvx1A_U6q5YGdZrZNBYCxpjHUMxnK0%3DX3Y_ZLrHug%40mail.gmail.com.