Re: [tesseract-ocr] inaccuracy in plane text

Ger Hobbelt Sun, 24 Dec 2023 19:40:48 -0800

On Sat, 23 Dec 2023, 19:16 Zdenko Podobny, <zde...@gmail.com> wrote:

>  tesseract expects black text (lettering) on a white background: that's
> what is has been trained on and that's what will work best. Hence: try to
> convert anything to look like that before feeding it to Tesseract.
>
>
> This is not needed (in all cases ;-) ): tesseract inverts a image by
> itself
> <https://github.com/tesseract-ocr/tesseract/blob/ea0b245c43ee850f1e571d469b313b90d58d8b13/src/lstm/lstmrecognizer.cpp#L349-L378>
>  for
> LSTM and uses OCR results with the best confidence. Practically it does not
> work for 100%. But if somebody cares about speed the best way is to use a
> binarized image with a white background and black text + usage of
> parameter tessedit_do_invert=0 (or new parameter
> <https://github.com/tesseract-ocr/tesseract/blob/ea0b245c43ee850f1e571d469b313b90d58d8b13/ChangeLog#L74-L75>
>  invert_threshold=0.0)
>


Oh yes, absolutely, but I've seen images where the lstm "recognized"
gobbledygook with a reported score /above/ 0.7 and thus skipping that
"let's see what the inverted clip gives us" code chunk. While I'm usually
fond of some extra detail like invert_threshold, there's way too many
novices running into trouble who are probably better off not knowing about
this option 😉 so they will put more effort into getting their images to
look like white paper (background) with black print on it, before they feed
it to tesseract and expect any kind of possibly decent result. Or so I
hope.😅


> (Someone did in depth research about this many years ago, published on
> this list including charts, but i can't find the link within 60 seconds.
> Lazy me, sorry)
>
>
> "Willus Dotkom" - link is part of most ignored tesseract part
> (documentation) - see
> https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md#rescaling
> :-)
>

Right on, bingo!

😰And I didn't check that page for it, while I did run a mailing list
search. Whoops!🤦

Seriously though: thanks for mentioning that link again. Very useful info
that has been, many times over.

Merry Christmas,

Ger




>
> Zdenko
>
>
> pi 22. 12. 2023 o 19:51 Ger Hobbelt <g...@hobbelt.com> napísal(a):
>
>> Couple of things to check/test:
>>
>> - tesseract expects black text (lettering) on white background: that's
>> what is has been trained on and that's what will work best. Hence: try to
>> convert anything to look like that before feeding it to tesseract.
>>
>> - tesseract was trained on text, if I recall correctly, that's 11pt.
>> Which is what you'll read in several places on the internet and is useless
>> info as-is because pt (points) are a printer/publisher unit of measure for
>> *paper* print, not computer images.
>>  However, this translates to 30-50px total character height, including
>> stem height for glyphs such as p,q,b and d, so the rule of thumb becomes:
>> try to make your text line fit in 30 to 50 pixels height, for possibly best
>> results. (Someone did in depth research about this many years ago,
>> published on this list including charts, but i can't find the link within
>> 60 seconds. Lazy me, sorry)
>>
>> - tesseract uses dictionary-like behaviour to help guestimate what it is
>> actually seeing (lstm can be argued to behave like a Markov chain, old
>> skool v3 OCR mode uses dictionaries) and that means tesseract very much
>> likes to see human language "words". Stuff like, if you just saw a q, and
>> your language in any Indo-European, you can bet your bottom the next glyph
>> will be 'u'. As in: "QUestion".
>>
>> Yours, however, is a semi-random letter matrix for a puzzle, so you may
>> want to look into ways to circumnavigate this tesseract behaviour because
>> you are feeding it stuff that's outside the original training domain
>> (books, publications, academic papers).
>> One approach to try is to go and cut the image up into individual
>> character images and feed each to tesseract individually; you MAY observe
>> better overall OCR results then.
>>
>> Second, since lstm is fundamentally like a Markov chain (rather: core has
>> Markov like behavioral aspects) and is NOT engineered for single glyph
>> recognition, you may also want to see how classic tesseract V3 OCR modes
>> are doing with your letter matrices as the older V3 engine is single-shape
>> based and thus *potentially* more suitable for use against semi-random,
>> independant, single character inputs like yours.
>>
>> My 2 cents. HTH
>>
>>
>>
>> On Wed, 20 Dec 2023, 13:33 Mishal Shanavas, <mishalshanav...@gmail.com>
>> wrote:
>>
>>> i can not extract text with reliable accuracy of a simple text
>>>
>>> [image: crop.png]
>>>
>>>
>>>
>>> check it out
>>>
>>> https://colab.research.google.com/drive/11utvWD3s6DqqGZQEnk5cKIAj46ZLsF5y?usp=sharing
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/f86e2d35-4c35-4643-835f-109994e46632n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/f86e2d35-4c35-4643-835f-109994e46632n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAFP60foDK7hCgpUEQES5aKFW-1Qfcs8R1H-1L%2BQQ%3D71G%2B8DNEQ%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAFP60foDK7hCgpUEQES5aKFW-1Qfcs8R1H-1L%2BQQ%3D71G%2B8DNEQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wXu9eLPzh7KWfRt1d0F2um7XExjyYa3L%3DO1W5HYN%2Bo3g%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wXu9eLPzh7KWfRt1d0F2um7XExjyYa3L%3DO1W5HYN%2Bo3g%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpcuHDGbCAxyg%2B2jNGLcxc96gu_qYzXomS0DTpkf9ehYQ%40mail.gmail.com.

Re: [tesseract-ocr] inaccuracy in plane text

Reply via email to