I put it to documentation because I had the same problem as you (to find it) :-)
Zdenko po 25. 12. 2023 o 4:40 Ger Hobbelt <g...@hobbelt.com> napísal(a): > > > On Sat, 23 Dec 2023, 19:16 Zdenko Podobny, <zde...@gmail.com> wrote: > >> tesseract expects black text (lettering) on a white background: that's >> what is has been trained on and that's what will work best. Hence: try to >> convert anything to look like that before feeding it to Tesseract. >> >> >> This is not needed (in all cases ;-) ): tesseract inverts a image by >> itself >> <https://github.com/tesseract-ocr/tesseract/blob/ea0b245c43ee850f1e571d469b313b90d58d8b13/src/lstm/lstmrecognizer.cpp#L349-L378> >> for >> LSTM and uses OCR results with the best confidence. Practically it does not >> work for 100%. But if somebody cares about speed the best way is to use a >> binarized image with a white background and black text + usage of >> parameter tessedit_do_invert=0 (or new parameter >> <https://github.com/tesseract-ocr/tesseract/blob/ea0b245c43ee850f1e571d469b313b90d58d8b13/ChangeLog#L74-L75> >> invert_threshold=0.0) >> > > Oh yes, absolutely, but I've seen images where the lstm "recognized" > gobbledygook with a reported score /above/ 0.7 and thus skipping that > "let's see what the inverted clip gives us" code chunk. While I'm usually > fond of some extra detail like invert_threshold, there's way too many > novices running into trouble who are probably better off not knowing about > this option 😉 so they will put more effort into getting their images to > look like white paper (background) with black print on it, before they feed > it to tesseract and expect any kind of possibly decent result. Or so I > hope.😅 > > >> (Someone did in depth research about this many years ago, published on >> this list including charts, but i can't find the link within 60 seconds. >> Lazy me, sorry) >> >> >> "Willus Dotkom" - link is part of most ignored tesseract part >> (documentation) - see >> https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md#rescaling >> :-) >> > > Right on, bingo! > > 😰And I didn't check that page for it, while I did run a mailing list > search. Whoops!🤦 > > Seriously though: thanks for mentioning that link again. Very useful info > that has been, many times over. > > Merry Christmas, > > Ger > > > > >> >> Zdenko >> >> >> pi 22. 12. 2023 o 19:51 Ger Hobbelt <g...@hobbelt.com> napísal(a): >> >>> Couple of things to check/test: >>> >>> - tesseract expects black text (lettering) on white background: that's >>> what is has been trained on and that's what will work best. Hence: try to >>> convert anything to look like that before feeding it to tesseract. >>> >>> - tesseract was trained on text, if I recall correctly, that's 11pt. >>> Which is what you'll read in several places on the internet and is useless >>> info as-is because pt (points) are a printer/publisher unit of measure for >>> *paper* print, not computer images. >>> However, this translates to 30-50px total character height, including >>> stem height for glyphs such as p,q,b and d, so the rule of thumb becomes: >>> try to make your text line fit in 30 to 50 pixels height, for possibly best >>> results. (Someone did in depth research about this many years ago, >>> published on this list including charts, but i can't find the link within >>> 60 seconds. Lazy me, sorry) >>> >>> - tesseract uses dictionary-like behaviour to help guestimate what it is >>> actually seeing (lstm can be argued to behave like a Markov chain, old >>> skool v3 OCR mode uses dictionaries) and that means tesseract very much >>> likes to see human language "words". Stuff like, if you just saw a q, and >>> your language in any Indo-European, you can bet your bottom the next glyph >>> will be 'u'. As in: "QUestion". >>> >>> Yours, however, is a semi-random letter matrix for a puzzle, so you may >>> want to look into ways to circumnavigate this tesseract behaviour because >>> you are feeding it stuff that's outside the original training domain >>> (books, publications, academic papers). >>> One approach to try is to go and cut the image up into individual >>> character images and feed each to tesseract individually; you MAY observe >>> better overall OCR results then. >>> >>> Second, since lstm is fundamentally like a Markov chain (rather: core >>> has Markov like behavioral aspects) and is NOT engineered for single glyph >>> recognition, you may also want to see how classic tesseract V3 OCR modes >>> are doing with your letter matrices as the older V3 engine is single-shape >>> based and thus *potentially* more suitable for use against semi-random, >>> independant, single character inputs like yours. >>> >>> My 2 cents. HTH >>> >>> >>> >>> On Wed, 20 Dec 2023, 13:33 Mishal Shanavas, <mishalshanav...@gmail.com> >>> wrote: >>> >>>> i can not extract text with reliable accuracy of a simple text >>>> >>>> [image: crop.png] >>>> >>>> >>>> >>>> check it out >>>> >>>> https://colab.research.google.com/drive/11utvWD3s6DqqGZQEnk5cKIAj46ZLsF5y?usp=sharing >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-ocr+unsubscr...@googlegroups.com. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/f86e2d35-4c35-4643-835f-109994e46632n%40googlegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/f86e2d35-4c35-4643-835f-109994e46632n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-ocr+unsubscr...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CAFP60foDK7hCgpUEQES5aKFW-1Qfcs8R1H-1L%2BQQ%3D71G%2B8DNEQ%40mail.gmail.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/CAFP60foDK7hCgpUEQES5aKFW-1Qfcs8R1H-1L%2BQQ%3D71G%2B8DNEQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wXu9eLPzh7KWfRt1d0F2um7XExjyYa3L%3DO1W5HYN%2Bo3g%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wXu9eLPzh7KWfRt1d0F2um7XExjyYa3L%3DO1W5HYN%2Bo3g%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpcuHDGbCAxyg%2B2jNGLcxc96gu_qYzXomS0DTpkf9ehYQ%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fpcuHDGbCAxyg%2B2jNGLcxc96gu_qYzXomS0DTpkf9ehYQ%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xcpSTrXzYJi4V2d-rKwhO5P3vtkd0z%2Bvg75UeyQOoudg%40mail.gmail.com.