Re: [tesseract-ocr] Re: How to prevern Tesseract from interpreting noise as characters

Zdenko Podobny Sun, 04 Aug 2024 04:44:35 -0700

tesseract unnamed.jpg -
Estimating resolution as 182

 e.g. no recognized word... So the problem could be in the parameters you
used for OCR...


Before OCR I suggest image preprocessing and maybe the detection of empty
pages.
Have a look at leptonica example for Normalize for uneven illumination
(pixBackgroundNorm in
https://github.com/DanBloomberg/leptonica/blob/master/prog/livre_adapt.c)
and then binarize image.
I think with some more "aggressive" parameters you can get a clean empty
page, so will not need to modify your OCR parameters...

Zdenko


ne 4. 8. 2024 o 13:22 Iain Downs <i...@idcl.co.uk> napísal(a):

> In the event that anyone else has a similar issue, this is how I
> approached it.
>
> Firstly, make a histogram of the number of pixels with each intensity (so
> an array of 256 numbers).
>
> When you inspect this you get results like the below.
>
> [image: Finding empty pages.png]
>
> This is after a little smoothing and taking the log of the values.
>
> You can see that the properly blank pages show little or no very dark
> (black) pixels, whereas the pages with some text, even if a small amount
> have a fair number.
>
> I simply set a cutoff level (in this case 1) and a cutoff intensity (in my
> case 80), so providing the first peak of 1 of the log smoothed intensity is
> below 80 it is text, otherwise it is blank.
>
> You can also see the problem which tesseract has (with default
> binarisation) in that the intensity is distinctly bimodal.  I think this is
> due to bleedthrough from the reverse of the page.  Of course that is
> essentially what OTSU uses pick out 'black' from 'white'.
>
> Iain
> On Tuesday, July 16, 2024 at 5:38:02 PM UTC+1 Iain Downs wrote:
>
>> I'm working on processing scanned paperback books with tesseract (C++ API
>> at the moment).  One issue I've found is that when a page has little or no
>> text tesseract gets overkeen and interprets the noise as text.
>>
>> The image below is the raw page.  In this case it's the inside front
>> cover of a book.
>> [image: HookRawPage.jpg]
>> This is the image after tesseract has processed it (binarization) and
>> before the character recognition.
>> [image: HookPostProcessed.jpg]
>>
>> tesseract suggests that there are 160 or so words (by some definition of
>> word!) on this page as per the attached (Hook02Small.txt).
>>
>> This also happens on pages which DO contain text but a small amount.  I
>> suspect that the binarization (possibly OTSU?) is to blame.  I can probable
>> do something to detect entirely blank pages, but less sure what do do with
>> mainly blank pages.
>>
>> Any suggestions most welcome!
>>
>> Iain
>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/e78f6620-4019-4e36-95cf-0aad5194313dn%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/e78f6620-4019-4e36-95cf-0aad5194313dn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8x6_gs_RYvHR83BbZoO2tKvDW_V-hyF1NC2osZ1y2LmxA%40mail.gmail.com.

Re: [tesseract-ocr] Re: How to prevern Tesseract from interpreting noise as characters

Reply via email to