Re: [tesseract-ocr] Re: Trying to understand why Tesseract-ocr fails on some images

Ger Hobbelt Sun, 30 Jul 2023 15:54:52 -0700

I haven't looked at the effect of the black stripe yet; I only had time to 
investigate your first image where you reported an OCR error (Tas <-> 11).
Frankly, I have no idea why that happens exactly; I've found where things 
go wrong *visually* but digging up which precise bit of the code decides to 
lop off the tops there is still an open question -- most of my time went 
into working on my diagnostics code, which is a work in progress (and 
benefits from your error reports!)
Hence I'm loath to say anything pro or contra another black bar.


THEORY (and old practice) would suggest another approach, which is to 
delete that bottom layer of pixels, so that "noise" is not picked up by 
leptonica as "diacritics" and causing the segmentation code to incorectly 
dimension the bounding boxes around the text as you can see (infer) from my 
partial screenshot. I ASSUME the black bar pushes the default Otsu 
threshold code to choose a lower (darker) cutoff, which would be a 
round-about way of "pushing those bottom noise pixels into the white" and 
thus "hiding" them from leptonica, but that is, right now, pure conjecture 
as I haven't checked yet what your black bar does re diagnostics output for 
tesseract.

To be addressed later this week I hope; next few days will be loaded with 
other (non IT) stuff here, so it'll take some time to reply to this.



On Monday, July 31, 2023 at 12:41:03 AM UTC+2 njsg...@gmail.com wrote:

> Hi Ger
>    Since a black stripe at the to of the image helps, DO you think putting 
> a similar stripe at the bottom of the image would help?
>
> Nor
>
>
> On 7/30/2023 6:14 PM, Ger Hobbelt wrote:
>
> I had a bit of time to run a sample of yours through my (customized) 
> tesseract rig and the OCR (reading "Tas" instead of "11") is reproducible 
> on my rig (5.3.2 + local patches). 
> This what comes out as part of the diagnostics report:
>
> [image: brave_1Yfwkzyjrg.png]
>
> The red hashed areas designate the surroundings of the "word bounding box" 
> currently processed in tesseract.
>
> As can be seen, for some very curious reason, the "11" get lopped off at 
> the top resulting in some weird OCR results (high confidence "Tas").
>
> I don't know WHY this happens exactly -- that requires further 
> investigation -- but this looks like a mishap in the segmentation code.
>
> (For others who are interested: this is HTML generated from my custom 
> tesseract; the text lines in the snapshot are tprintf() output, while the 
> images have been added as part of the debug code, where the hashing, 
> clipping, etc. is done via leptonica.)
>
> BTW: also note that the noise line at the bottom of the cropped image also 
> affect the segmentation as the boxes all reach all the way to the bottom. 
> The bottom line noise is reported as "found some diacritics" and thus 
> influences the line/segmentation code as well. But this DOES NOT explain 
> why the "11"s get lopped off, while the other digits are not: see the 
> screenshot.
>
> Food for thought (and debugging).
>
> Binarized image resulting from tesseract default Otsu 
> thresholding/binarization is attached as well: here the bottom line noise 
> is clearly visible.
>
> [image: 
> nor-bushnell-decoded-debug.n0004.img0029.Setup.Page.Seg.And.Detect.Orientation.png]
>
> (This image is the binarized b&w image used internally by tesseract as the 
> source image for segmentation/ocr/etc., which is *blended* with the 
> original source image (as a subdued rose background); what matters here are 
> the pure black pixels as those are what tesseract sees once we get at the 
> segmentation + ocr stage.)
>
>
>
> That's it for now; AFK for a while again.
>
>
>
> Met vriendelijke groeten / Best regards,
>
> Ger Hobbelt
>
> --------------------------------------------------
> web:    http://www.hobbelt.com/
>         http://www.hebbut.net/
> mail:   g...@hobbelt.com
> mobile: +31-6-11 120 978
> --------------------------------------------------
>
>
> On Fri, Jul 28, 2023 at 6:35 PM astro <njsg...@gmail.com> wrote:
>
>> Still playing around with improving Tesseract-OCR 's results.
>>
>>   One more data point. As mentioned in my previous post, I found that if 
>> there is a dark border at the top of the cropped image the OCR works 
>> much better. With that in mine, I decided to add my own 25 pixel black 
>> border to the top of the cropped image by adding the draw command to the 
>> command line input for ImageMagick ( see attached). With this simple 
>> addition I'm able to get 100% conversion in most cases.
>>
>> Cheers
>>   Nor
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/90a2dfae-8bac-d953-9951-1f52597c82c2%40gmail.com
>> .
>>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to tesseract-oc...@googlegroups.com.
>
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fr2nUFJ2aZTvzSeDi-zbLWUxCWR-Sq7waNsysOvZhmRQQ%40mail.gmail.com
>  
> <https://groups.google.com/d/msgid/tesseract-ocr/CAFP60fr2nUFJ2aZTvzSeDi-zbLWUxCWR-Sq7waNsysOvZhmRQQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/fb3640e8-4c49-49ac-ae01-0ae65bfb926bn%40googlegroups.com.

Re: [tesseract-ocr] Re: Trying to understand why Tesseract-ocr fails on some images

Reply via email to