tesseract expects black text (lettering) on a white background: that's
what is has been trained on and that's what will work best. Hence: try to
convert anything to look like that before feeding it to Tesseract.


This is not needed (in all cases ;-) ): tesseract inverts a image by itself
<https://github.com/tesseract-ocr/tesseract/blob/ea0b245c43ee850f1e571d469b313b90d58d8b13/src/lstm/lstmrecognizer.cpp#L349-L378>
for
LSTM and uses OCR results with the best confidence. Practically it does not
work for 100%. But if somebody cares about speed the best way is to use a
binarized image with a white background and black text + usage of
parameter tessedit_do_invert=0 (or new parameter
<https://github.com/tesseract-ocr/tesseract/blob/ea0b245c43ee850f1e571d469b313b90d58d8b13/ChangeLog#L74-L75>
 invert_threshold=0.0)

(Someone did in depth research about this many years ago, published on this
list including charts, but i can't find the link within 60 seconds. Lazy
me, sorry)


"Willus Dotkom" - link is part of most ignored tesseract part
(documentation) - see
https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md#rescaling
:-)


Zdenko


pi 22. 12. 2023 o 19:51 Ger Hobbelt <g...@hobbelt.com> napĂ­sal(a):

> Couple of things to check/test:
>
> - tesseract expects black text (lettering) on white background: that's
> what is has been trained on and that's what will work best. Hence: try to
> convert anything to look like that before feeding it to tesseract.
>
> - tesseract was trained on text, if I recall correctly, that's 11pt. Which
> is what you'll read in several places on the internet and is useless info
> as-is because pt (points) are a printer/publisher unit of measure for
> *paper* print, not computer images.
>  However, this translates to 30-50px total character height, including
> stem height for glyphs such as p,q,b and d, so the rule of thumb becomes:
> try to make your text line fit in 30 to 50 pixels height, for possibly best
> results. (Someone did in depth research about this many years ago,
> published on this list including charts, but i can't find the link within
> 60 seconds. Lazy me, sorry)
>
> - tesseract uses dictionary-like behaviour to help guestimate what it is
> actually seeing (lstm can be argued to behave like a Markov chain, old
> skool v3 OCR mode uses dictionaries) and that means tesseract very much
> likes to see human language "words". Stuff like, if you just saw a q, and
> your language in any Indo-European, you can bet your bottom the next glyph
> will be 'u'. As in: "QUestion".
>
> Yours, however, is a semi-random letter matrix for a puzzle, so you may
> want to look into ways to circumnavigate this tesseract behaviour because
> you are feeding it stuff that's outside the original training domain
> (books, publications, academic papers).
> One approach to try is to go and cut the image up into individual
> character images and feed each to tesseract individually; you MAY observe
> better overall OCR results then.
>
> Second, since lstm is fundamentally like a Markov chain (rather: core has
> Markov like behavioral aspects) and is NOT engineered for single glyph
> recognition, you may also want to see how classic tesseract V3 OCR modes
> are doing with your letter matrices as the older V3 engine is single-shape
> based and thus *potentially* more suitable for use against semi-random,
> independant, single character inputs like yours.
>
> My 2 cents. HTH
>
>
>
> On Wed, 20 Dec 2023, 13:33 Mishal Shanavas, <mishalshanav...@gmail.com>
> wrote:
>
>> i can not extract text with reliable accuracy of a simple text
>>
>> [image: crop.png]
>>
>>
>>
>> check it out
>>
>> https://colab.research.google.com/drive/11utvWD3s6DqqGZQEnk5cKIAj46ZLsF5y?usp=sharing
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/f86e2d35-4c35-4643-835f-109994e46632n%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/f86e2d35-4c35-4643-835f-109994e46632n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAFP60foDK7hCgpUEQES5aKFW-1Qfcs8R1H-1L%2BQQ%3D71G%2B8DNEQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAFP60foDK7hCgpUEQES5aKFW-1Qfcs8R1H-1L%2BQQ%3D71G%2B8DNEQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wXu9eLPzh7KWfRt1d0F2um7XExjyYa3L%3DO1W5HYN%2Bo3g%40mail.gmail.com.

Reply via email to