Re: [tesseract-ocr] Re: Tesseract mistakes letters for numbers

Eric Hodges Thu, 12 Aug 2021 06:25:53 -0700

Thanks for your input, but we can't train Tesseract for any fonts. We are
using it for mail that comes from thousands of sources. We have no control
over which fonts are used.


We were able to improve results (from 8% success to 87%) by running
Tesseract multiple times. One pass looked for letters, one for digits, one
for punctuation. If we knew the format the word might take we could improve
accuracy that way. But we found no good solution for mixed letters and
digits when we don't know the format.

On Wed, Aug 11, 2021 at 11:51 PM Ajinkya Bobade <[email protected]>
wrote:

> Hello,
>
> To do this you will need to retrain Tessearct on top of the model that you
> currently use. The current model that you use is not trained on this
> specific font, so it approximates the digit, take few samples of the format
> that you need and retrain it on top of original weights. If you have more
> questions feel free to email me.
>
> Regards
> Ajinkya
> Creator of AI Scanner https://imagescanner-online.com/
>
> On Thursday, 22 July 2021 at 00:07:15 UTC+5:30 [email protected]
> wrote:
>
>> Update:
>>
>> I discovered the command line option:
>>
>>     -c load_number_dawg=0
>>
>> That did not improve my results.
>>
>> On Wednesday, July 21, 2021 at 1:07:15 PM UTC-5 Eric Hodges wrote:
>>
>>> I need some help. I have a bunch of images of text like this:
>>>
>>> [image: sample_si.jpg]
>>> They are all 200 dpi, black and white images. In over 50% of the cases,
>>> Tesseract confuses the "SI" at the front for digits. Most of them are "51",
>>> but some are "81" or "31".
>>>
>>> I've tried tweaking all of the settings I can find, but none of them
>>> improve the results. I'm currently using a config file like this:
>>>
>>> tessedit_char_whitelist ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789
>>>
>>> Interesting fact: If I cut off the digits and only send the alphas to
>>> Tesseract, it recognizes them correctly. Is there something in Tesseract
>>> that makes it less likely to mix letters and numbers in a single word?
>>>
>>> Any suggestions?
>>>
>> --
> You received this message because you are subscribed to a topic in the
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/tesseract-ocr/2ti8v1hea88/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/71e52bfe-0a27-44b1-b70e-2907aa722561n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/71e52bfe-0a27-44b1-b70e-2907aa722561n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>


-- 

Eric Hodges
Sr. Product Engineer
[email protected]
O: 281-504-8165 <(281)+504-8165>
U.S. Dataworks <http://www.usdataworks.com/>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAKusfpPb6ock%2Bnx%2B1bqaQs-FZ_iOUJMAbV_hjt5sHkFfOscnoA%40mail.gmail.com.

Re: [tesseract-ocr] Re: Tesseract mistakes letters for numbers

Reply via email to