Re: [tesseract-ocr] inaccuracy in plane text

Ger Hobbelt Fri, 22 Dec 2023 10:51:39 -0800

Couple of things to check/test:

- tesseract expects black text (lettering) on white background: that's what
is has been trained on and that's what will work best. Hence: try to
convert anything to look like that before feeding it to tesseract.


- tesseract was trained on text, if I recall correctly, that's 11pt. Which
is what you'll read in several places on the internet and is useless info
as-is because pt (points) are a printer/publisher unit of measure for
*paper* print, not computer images.
 However, this translates to 30-50px total character height, including stem
height for glyphs such as p,q,b and d, so the rule of thumb becomes: try to
make your text line fit in 30 to 50 pixels height, for possibly best
results. (Someone did in depth research about this many years ago,
published on this list including charts, but i can't find the link within
60 seconds. Lazy me, sorry)

- tesseract uses dictionary-like behaviour to help guestimate what it is
actually seeing (lstm can be argued to behave like a Markov chain, old
skool v3 OCR mode uses dictionaries) and that means tesseract very much
likes to see human language "words". Stuff like, if you just saw a q, and
your language in any Indo-European, you can bet your bottom the next glyph
will be 'u'. As in: "QUestion".

Yours, however, is a semi-random letter matrix for a puzzle, so you may
want to look into ways to circumnavigate this tesseract behaviour because
you are feeding it stuff that's outside the original training domain
(books, publications, academic papers).
One approach to try is to go and cut the image up into individual character
images and feed each to tesseract individually; you MAY observe better
overall OCR results then.

Second, since lstm is fundamentally like a Markov chain (rather: core has
Markov like behavioral aspects) and is NOT engineered for single glyph
recognition, you may also want to see how classic tesseract V3 OCR modes
are doing with your letter matrices as the older V3 engine is single-shape
based and thus *potentially* more suitable for use against semi-random,
independant, single character inputs like yours.

My 2 cents. HTH



On Wed, 20 Dec 2023, 13:33 Mishal Shanavas, <mishalshanav...@gmail.com>
wrote:

> i can not extract text with reliable accuracy of a simple text
>
> [image: crop.png]
>
>
>
> check it out
>
> https://colab.research.google.com/drive/11utvWD3s6DqqGZQEnk5cKIAj46ZLsF5y?usp=sharing
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/f86e2d35-4c35-4643-835f-109994e46632n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/f86e2d35-4c35-4643-835f-109994e46632n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAFP60foDK7hCgpUEQES5aKFW-1Qfcs8R1H-1L%2BQQ%3D71G%2B8DNEQ%40mail.gmail.com.

Re: [tesseract-ocr] inaccuracy in plane text

Reply via email to