tesseract expects black text (lettering) on a white background: that's what is has been trained on and that's what will work best. Hence: try to convert anything to look like that before feeding it to Tesseract.
This is not needed (in all cases ;-) ): tesseract inverts a image by itself <https://github.com/tesseract-ocr/tesseract/blob/ea0b245c43ee850f1e571d469b313b90d58d8b13/src/lstm/lstmrecognizer.cpp#L349-L378> for LSTM and uses OCR results with the best confidence. Practically it does not work for 100%. But if somebody cares about speed the best way is to use a binarized image with a white background and black text + usage of parameter tessedit_do_invert=0 (or new parameter <https://github.com/tesseract-ocr/tesseract/blob/ea0b245c43ee850f1e571d469b313b90d58d8b13/ChangeLog#L74-L75> invert_threshold=0.0) (Someone did in depth research about this many years ago, published on this list including charts, but i can't find the link within 60 seconds. Lazy me, sorry) "Willus Dotkom" - link is part of most ignored tesseract part (documentation) - see https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md#rescaling :-) Zdenko pi 22. 12. 2023 o 19:51 Ger Hobbelt <g...@hobbelt.com> napĂsal(a): > Couple of things to check/test: > > - tesseract expects black text (lettering) on white background: that's > what is has been trained on and that's what will work best. Hence: try to > convert anything to look like that before feeding it to tesseract. > > - tesseract was trained on text, if I recall correctly, that's 11pt. Which > is what you'll read in several places on the internet and is useless info > as-is because pt (points) are a printer/publisher unit of measure for > *paper* print, not computer images. > However, this translates to 30-50px total character height, including > stem height for glyphs such as p,q,b and d, so the rule of thumb becomes: > try to make your text line fit in 30 to 50 pixels height, for possibly best > results. (Someone did in depth research about this many years ago, > published on this list including charts, but i can't find the link within > 60 seconds. Lazy me, sorry) > > - tesseract uses dictionary-like behaviour to help guestimate what it is > actually seeing (lstm can be argued to behave like a Markov chain, old > skool v3 OCR mode uses dictionaries) and that means tesseract very much > likes to see human language "words". Stuff like, if you just saw a q, and > your language in any Indo-European, you can bet your bottom the next glyph > will be 'u'. As in: "QUestion". > > Yours, however, is a semi-random letter matrix for a puzzle, so you may > want to look into ways to circumnavigate this tesseract behaviour because > you are feeding it stuff that's outside the original training domain > (books, publications, academic papers). > One approach to try is to go and cut the image up into individual > character images and feed each to tesseract individually; you MAY observe > better overall OCR results then. > > Second, since lstm is fundamentally like a Markov chain (rather: core has > Markov like behavioral aspects) and is NOT engineered for single glyph > recognition, you may also want to see how classic tesseract V3 OCR modes > are doing with your letter matrices as the older V3 engine is single-shape > based and thus *potentially* more suitable for use against semi-random, > independant, single character inputs like yours. > > My 2 cents. HTH > > > > On Wed, 20 Dec 2023, 13:33 Mishal Shanavas, <mishalshanav...@gmail.com> > wrote: > >> i can not extract text with reliable accuracy of a simple text >> >> [image: crop.png] >> >> >> >> check it out >> >> https://colab.research.google.com/drive/11utvWD3s6DqqGZQEnk5cKIAj46ZLsF5y?usp=sharing >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/f86e2d35-4c35-4643-835f-109994e46632n%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/f86e2d35-4c35-4643-835f-109994e46632n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAFP60foDK7hCgpUEQES5aKFW-1Qfcs8R1H-1L%2BQQ%3D71G%2B8DNEQ%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAFP60foDK7hCgpUEQES5aKFW-1Qfcs8R1H-1L%2BQQ%3D71G%2B8DNEQ%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wXu9eLPzh7KWfRt1d0F2um7XExjyYa3L%3DO1W5HYN%2Bo3g%40mail.gmail.com.