Zdenko - I've had a look at the sample code (in C++!) and tried it out on my files. It clearly works well at cleaning the pages up, but does no better on my 'empty' pages than my histogram approach. Also, and unfortunately, I get (slightly) worse correctness in recognising the text than with the default Tesseract processing, so for the moment I don't think I will take this approach. However, many thanks for the input.
Iain On Tuesday, August 6, 2024 at 3:56:08 PM UTC+1 Iain Downs wrote: > Thanks for this Zdenko. I've had a look at some resources on 'greyscale > closing' and kind of get it. However, my app is currently in c# and the > library I'm using does all the pix functions. I will try and build the > sample in C++ and see what it does. > > Iain > > On Sunday, August 4, 2024 at 12:44:41 PM UTC+1 zdenop wrote: > >> tesseract unnamed.jpg - >> Estimating resolution as 182 >> >> e.g. no recognized word... So the problem could be in the parameters you >> used for OCR... >> >> Before OCR I suggest image preprocessing and maybe the detection of empty >> pages. >> Have a look at leptonica example for Normalize for uneven illumination >> (pixBackgroundNorm in >> https://github.com/DanBloomberg/leptonica/blob/master/prog/livre_adapt.c) >> and then binarize image. >> I think with some more "aggressive" parameters you can get a clean empty >> page, so will not need to modify your OCR parameters... >> >> Zdenko >> >> >> ne 4. 8. 2024 o 13:22 Iain Downs <ia...@idcl.co.uk> napísal(a): >> >>> In the event that anyone else has a similar issue, this is how I >>> approached it. >>> >>> Firstly, make a histogram of the number of pixels with each intensity >>> (so an array of 256 numbers). >>> >>> When you inspect this you get results like the below. >>> >>> [image: Finding empty pages.png] >>> >>> This is after a little smoothing and taking the log of the values. >>> >>> You can see that the properly blank pages show little or no very dark >>> (black) pixels, whereas the pages with some text, even if a small amount >>> have a fair number. >>> >>> I simply set a cutoff level (in this case 1) and a cutoff intensity (in >>> my case 80), so providing the first peak of 1 of the log smoothed intensity >>> is below 80 it is text, otherwise it is blank. >>> >>> You can also see the problem which tesseract has (with default >>> binarisation) in that the intensity is distinctly bimodal. I think this is >>> due to bleedthrough from the reverse of the page. Of course that is >>> essentially what OTSU uses pick out 'black' from 'white'. >>> >>> Iain >>> On Tuesday, July 16, 2024 at 5:38:02 PM UTC+1 Iain Downs wrote: >>> >>>> I'm working on processing scanned paperback books with tesseract (C++ >>>> API at the moment). One issue I've found is that when a page has little >>>> or >>>> no text tesseract gets overkeen and interprets the noise as text. >>>> >>>> The image below is the raw page. In this case it's the inside front >>>> cover of a book. >>>> [image: HookRawPage.jpg] >>>> This is the image after tesseract has processed it (binarization) and >>>> before the character recognition. >>>> [image: HookPostProcessed.jpg] >>>> >>>> tesseract suggests that there are 160 or so words (by some definition >>>> of word!) on this page as per the attached (Hook02Small.txt). >>>> >>>> This also happens on pages which DO contain text but a small amount. I >>>> suspect that the binarization (possibly OTSU?) is to blame. I can >>>> probable >>>> do something to detect entirely blank pages, but less sure what do do with >>>> mainly blank pages. >>>> >>>> Any suggestions most welcome! >>>> >>>> Iain >>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/e78f6620-4019-4e36-95cf-0aad5194313dn%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/e78f6620-4019-4e36-95cf-0aad5194313dn%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9a5edc55-57f3-45d1-b9b5-87bf0feeee42n%40googlegroups.com.