Thanks for your input, but we can't train Tesseract for any fonts. We are using it for mail that comes from thousands of sources. We have no control over which fonts are used.
We were able to improve results (from 8% success to 87%) by running Tesseract multiple times. One pass looked for letters, one for digits, one for punctuation. If we knew the format the word might take we could improve accuracy that way. But we found no good solution for mixed letters and digits when we don't know the format. On Wed, Aug 11, 2021 at 11:51 PM Ajinkya Bobade <ajinkyabobad...@gmail.com> wrote: > Hello, > > To do this you will need to retrain Tessearct on top of the model that you > currently use. The current model that you use is not trained on this > specific font, so it approximates the digit, take few samples of the format > that you need and retrain it on top of original weights. If you have more > questions feel free to email me. > > Regards > Ajinkya > Creator of AI Scanner https://imagescanner-online.com/ > > On Thursday, 22 July 2021 at 00:07:15 UTC+5:30 eho...@usdataworks.com > wrote: > >> Update: >> >> I discovered the command line option: >> >> -c load_number_dawg=0 >> >> That did not improve my results. >> >> On Wednesday, July 21, 2021 at 1:07:15 PM UTC-5 Eric Hodges wrote: >> >>> I need some help. I have a bunch of images of text like this: >>> >>> [image: sample_si.jpg] >>> They are all 200 dpi, black and white images. In over 50% of the cases, >>> Tesseract confuses the "SI" at the front for digits. Most of them are "51", >>> but some are "81" or "31". >>> >>> I've tried tweaking all of the settings I can find, but none of them >>> improve the results. I'm currently using a config file like this: >>> >>> tessedit_char_whitelist ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 >>> >>> Interesting fact: If I cut off the digits and only send the alphas to >>> Tesseract, it recognizes them correctly. Is there something in Tesseract >>> that makes it less likely to mix letters and numbers in a single word? >>> >>> Any suggestions? >>> >> -- > You received this message because you are subscribed to a topic in the > Google Groups "tesseract-ocr" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/tesseract-ocr/2ti8v1hea88/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/71e52bfe-0a27-44b1-b70e-2907aa722561n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/71e52bfe-0a27-44b1-b70e-2907aa722561n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- Eric Hodges Sr. Product Engineer ehod...@usdataworks.com O: 281-504-8165 <(281)+504-8165> U.S. Dataworks <http://www.usdataworks.com/> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAKusfpPb6ock%2Bnx%2B1bqaQs-FZ_iOUJMAbV_hjt5sHkFfOscnoA%40mail.gmail.com.