Re: [tesseract-ocr] inaccuracy in plane text

Zdenko Podobny Mon, 25 Dec 2023 02:15:14 -0800

I put it to documentation because I had the same problem as you (to find
it) :-)


Zdenko


po 25. 12. 2023 o 4:40 Ger Hobbelt <g...@hobbelt.com> napísal(a):

>
>
> On Sat, 23 Dec 2023, 19:16 Zdenko Podobny, <zde...@gmail.com> wrote:
>
>>  tesseract expects black text (lettering) on a white background: that's
>> what is has been trained on and that's what will work best. Hence: try to
>> convert anything to look like that before feeding it to Tesseract.
>>
>>
>> This is not needed (in all cases ;-) ): tesseract inverts a image by
>> itself
>> <https://github.com/tesseract-ocr/tesseract/blob/ea0b245c43ee850f1e571d469b313b90d58d8b13/src/lstm/lstmrecognizer.cpp#L349-L378>
>>  for
>> LSTM and uses OCR results with the best confidence. Practically it does not
>> work for 100%. But if somebody cares about speed the best way is to use a
>> binarized image with a white background and black text + usage of
>> parameter tessedit_do_invert=0 (or new parameter
>> <https://github.com/tesseract-ocr/tesseract/blob/ea0b245c43ee850f1e571d469b313b90d58d8b13/ChangeLog#L74-L75>
>>  invert_threshold=0.0)
>>
>
> Oh yes, absolutely, but I've seen images where the lstm "recognized"
> gobbledygook with a reported score /above/ 0.7 and thus skipping that
> "let's see what the inverted clip gives us" code chunk. While I'm usually
> fond of some extra detail like invert_threshold, there's way too many
> novices running into trouble who are probably better off not knowing about
> this option 😉 so they will put more effort into getting their images to
> look like white paper (background) with black print on it, before they feed
> it to tesseract and expect any kind of possibly decent result. Or so I
> hope.😅
>
>
>> (Someone did in depth research about this many years ago, published on
>> this list including charts, but i can't find the link within 60 seconds.
>> Lazy me, sorry)
>>
>>
>> "Willus Dotkom" - link is part of most ignored tesseract part
>> (documentation) - see
>> https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md#rescaling
>> :-)
>>
>
> Right on, bingo!
>
> 😰And I didn't check that page for it, while I did run a mailing list
> search. Whoops!🤦
>
> Seriously though: thanks for mentioning that link again. Very useful info
> that has been, many times over.
>
> Merry Christmas,
>
> Ger
>
>
>
>
>>
>> Zdenko
>>
>>
>> pi 22. 12. 2023 o 19:51 Ger Hobbelt <g...@hobbelt.com> napísal(a):
>>
>>> Couple of things to check/test:
>>>
>>> - tesseract expects black text (lettering) on white background: that's
>>> what is has been trained on and that's what will work best. Hence: try to
>>> convert anything to look like that before feeding it to tesseract.
>>>
>>> - tesseract was trained on text, if I recall correctly, that's 11pt.
>>> Which is what you'll read in several places on the internet and is useless
>>> info as-is because pt (points) are a printer/publisher unit of measure for
>>> *paper* print, not computer images.
>>>  However, this translates to 30-50px total character height, including
>>> stem height for glyphs such as p,q,b and d, so the rule of thumb becomes:
>>> try to make your text line fit in 30 to 50 pixels height, for possibly best
>>> results. (Someone did in depth research about this many years ago,
>>> published on this list including charts, but i can't find the link within
>>> 60 seconds. Lazy me, sorry)
>>>
>>> - tesseract uses dictionary-like behaviour to help guestimate what it is
>>> actually seeing (lstm can be argued to behave like a Markov chain, old
>>> skool v3 OCR mode uses dictionaries) and that means tesseract very much
>>> likes to see human language "words". Stuff like, if you just saw a q, and
>>> your language in any Indo-European, you can bet your bottom the next glyph
>>> will be 'u'. As in: "QUestion".
>>>
>>> Yours, however, is a semi-random letter matrix for a puzzle, so you may
>>> want to look into ways to circumnavigate this tesseract behaviour because
>>> you are feeding it stuff that's outside the original training domain
>>> (books, publications, academic papers).
>>> One approach to try is to go and cut the image up into individual
>>> character images and feed each to tesseract individually; you MAY observe
>>> better overall OCR results then.
>>>
>>> Second, since lstm is fundamentally like a Markov chain (rather: core
>>> has Markov like behavioral aspects) and is NOT engineered for single glyph
>>> recognition, you may also want to see how classic tesseract V3 OCR modes
>>> are doing with your letter matrices as the older V3 engine is single-shape
>>> based and thus *potentially* more suitable for use against semi-random,
>>> independant, single character inputs like yours.
>>>
>>> My 2 cents. HTH
>>>
>>>
>>>
>>> On Wed, 20 Dec 2023, 13:33 Mishal Shanavas, <mishalshanav...@gmail.com>
>>> wrote:
>>>
>>>> i can not extract text with reliable accuracy of a simple text
>>>>
>>>> [image: crop.png]
>>>>
>>>>
>>>>
>>>> check it out
>>>>
>>>> https://colab.research.google.com/drive/11utvWD3s6DqqGZQEnk5cKIAj46ZLsF5y?usp=sharing
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/f86e2d35-4c35-4643-835f-109994e46632n%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/f86e2d35-4c35-4643-835f-109994e46632n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/CAFP60foDK7hCgpUEQES5aKFW-1Qfcs8R1H-1L%2BQQ%3D71G%2B8DNEQ%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAFP60foDK7hCgpUEQES5aKFW-1Qfcs8R1H-1L%2BQQ%3D71G%2B8DNEQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wXu9eLPzh7KWfRt1d0F2um7XExjyYa3L%3DO1W5HYN%2Bo3g%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wXu9eLPzh7KWfRt1d0F2um7XExjyYa3L%3DO1W5HYN%2Bo3g%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpcuHDGbCAxyg%2B2jNGLcxc96gu_qYzXomS0DTpkf9ehYQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpcuHDGbCAxyg%2B2jNGLcxc96gu_qYzXomS0DTpkf9ehYQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xcpSTrXzYJi4V2d-rKwhO5P3vtkd0z%2Bvg75UeyQOoudg%40mail.gmail.com.

Re: [tesseract-ocr] inaccuracy in plane text

Reply via email to