L.S., Finally took the time to debug this as the '11'->'Tas' image-to-OCR-text conversion was a very curious one.
Turns out tesseract has a bug relatively deep inside its innards, where the actual code DOES NOT take the binarized pixel data (as one would expect it would use for OCR as those black&white pixels represent the *cleaned-up* source image) but grabs the (noisy!) *original image pixels* instead and feeds those straight into the LSTM engine, resulting in surprising OCR failures. See https://github.com/tesseract-ocr/tesseract/pull/4111 for a submitted bugfix and an extended description/analysis. I expect this to impact more folks (including *myself*) who have/had WTF trouble with color image inputs and other non-black&white and/or noisy image sources (old book scans, etc.), but I haven't had time to check more images, apart from the first sample reported by astro/Nor. BTW: thanks to astro/Nor, the OP, for solid reporting; this enabled this evening's debug session and root cause analysis to happen at all! Closing in on 3AM here so sleep is overdue; I hope others can reproduce my findings and not discover screw-ups on my part! 😅😅 Best regards, Ger P.S.: I haven't dug further into the "lopped off" effect as previously observed by me (see screenshot of augmented diagnostics output earlier in this message chain) as this is reasonably explicable by this bug + fix, which was discovered while taking out the BestPix() API, after which I ran a `git bisect` driven by tesseract OCR test runs to dig up the actual commit where the fix occurred *by happenstance*. All that means is that I'm a bit hand-wavey about that "lopped-off top of '11'" bounded-box as seen before: time/effort restrictions apply so there *might* be more lurking in that section of the tesseract codebase still... I'm not 100% sure, 's all I'm sayin'. 😅 On Monday, July 31, 2023 at 12:54:47 AM UTC+2 Ger Hobbelt wrote: > I haven't looked at the effect of the black stripe yet; I only had time to > investigate your first image where you reported an OCR error (Tas <-> 11). > Frankly, I have no idea why that happens exactly; I've found where things > go wrong *visually* but digging up which precise bit of the code decides to > lop off the tops there is still an open question -- most of my time went > into working on my diagnostics code, which is a work in progress (and > benefits from your error reports!) > Hence I'm loath to say anything pro or contra another black bar. > > THEORY (and old practice) would suggest another approach, which is to > delete that bottom layer of pixels, so that "noise" is not picked up by > leptonica as "diacritics" and causing the segmentation code to incorectly > dimension the bounding boxes around the text as you can see (infer) from my > partial screenshot. I ASSUME the black bar pushes the default Otsu > threshold code to choose a lower (darker) cutoff, which would be a > round-about way of "pushing those bottom noise pixels into the white" and > thus "hiding" them from leptonica, but that is, right now, pure conjecture > as I haven't checked yet what your black bar does re diagnostics output for > tesseract. > > To be addressed later this week I hope; next few days will be loaded with > other (non IT) stuff here, so it'll take some time to reply to this. > > > > On Monday, July 31, 2023 at 12:41:03 AM UTC+2 njsg...@gmail.com wrote: > >> Hi Ger >> Since a black stripe at the to of the image helps, DO you think >> putting a similar stripe at the bottom of the image would help? >> >> Nor >> >> >> On 7/30/2023 6:14 PM, Ger Hobbelt wrote: >> >> I had a bit of time to run a sample of yours through my (customized) >> tesseract rig and the OCR (reading "Tas" instead of "11") is reproducible >> on my rig (5.3.2 + local patches). >> This what comes out as part of the diagnostics report: >> >> [image: brave_1Yfwkzyjrg.png] >> >> The red hashed areas designate the surroundings of the "word bounding >> box" currently processed in tesseract. >> >> As can be seen, for some very curious reason, the "11" get lopped off at >> the top resulting in some weird OCR results (high confidence "Tas"). >> >> I don't know WHY this happens exactly -- that requires further >> investigation -- but this looks like a mishap in the segmentation code. >> >> (For others who are interested: this is HTML generated from my custom >> tesseract; the text lines in the snapshot are tprintf() output, while the >> images have been added as part of the debug code, where the hashing, >> clipping, etc. is done via leptonica.) >> >> BTW: also note that the noise line at the bottom of the cropped image >> also affect the segmentation as the boxes all reach all the way to the >> bottom. The bottom line noise is reported as "found some diacritics" and >> thus influences the line/segmentation code as well. But this DOES NOT >> explain why the "11"s get lopped off, while the other digits are not: see >> the screenshot. >> >> Food for thought (and debugging). >> >> Binarized image resulting from tesseract default Otsu >> thresholding/binarization is attached as well: here the bottom line noise >> is clearly visible. >> >> [image: >> nor-bushnell-decoded-debug.n0004.img0029.Setup.Page.Seg.And.Detect.Orientation.png] >> >> (This image is the binarized b&w image used internally by tesseract as >> the source image for segmentation/ocr/etc., which is *blended* with the >> original source image (as a subdued rose background); what matters here are >> the pure black pixels as those are what tesseract sees once we get at the >> segmentation + ocr stage.) >> >> >> >> That's it for now; AFK for a while again. >> >> >> >> Met vriendelijke groeten / Best regards, >> >> Ger Hobbelt >> >> -------------------------------------------------- >> web: http://www.hobbelt.com/ >> http://www.hebbut.net/ >> mail: g...@hobbelt.com >> mobile: +31-6-11 120 978 >> -------------------------------------------------- >> >> >> On Fri, Jul 28, 2023 at 6:35 PM astro <njsg...@gmail.com> wrote: >> >>> Still playing around with improving Tesseract-OCR 's results. >>> >>> One more data point. As mentioned in my previous post, I found that if >>> there is a dark border at the top of the cropped image the OCR works >>> much better. With that in mine, I decided to add my own 25 pixel black >>> border to the top of the cropped image by adding the draw command to the >>> command line input for ImageMagick ( see attached). With this simple >>> addition I'm able to get 100% conversion in most cases. >>> >>> Cheers >>> Nor >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/90a2dfae-8bac-d953-9951-1f52597c82c2%40gmail.com >>> . >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com. >> >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fr2nUFJ2aZTvzSeDi-zbLWUxCWR-Sq7waNsysOvZhmRQQ%40mail.gmail.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fr2nUFJ2aZTvzSeDi-zbLWUxCWR-Sq7waNsysOvZhmRQQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> >> >> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/81bcaba2-215e-435b-979a-6fa5613ce7b7n%40googlegroups.com.