In the event that anyone else has a similar issue, this is how I approached it.
Firstly, make a histogram of the number of pixels with each intensity (so an array of 256 numbers). When you inspect this you get results like the below. [image: Finding empty pages.png] This is after a little smoothing and taking the log of the values. You can see that the properly blank pages show little or no very dark (black) pixels, whereas the pages with some text, even if a small amount have a fair number. I simply set a cutoff level (in this case 1) and a cutoff intensity (in my case 80), so providing the first peak of 1 of the log smoothed intensity is below 80 it is text, otherwise it is blank. You can also see the problem which tesseract has (with default binarisation) in that the intensity is distinctly bimodal. I think this is due to bleedthrough from the reverse of the page. Of course that is essentially what OTSU uses pick out 'black' from 'white'. Iain On Tuesday, July 16, 2024 at 5:38:02 PM UTC+1 Iain Downs wrote: > I'm working on processing scanned paperback books with tesseract (C++ API > at the moment). One issue I've found is that when a page has little or no > text tesseract gets overkeen and interprets the noise as text. > > The image below is the raw page. In this case it's the inside front cover > of a book. > [image: HookRawPage.jpg] > This is the image after tesseract has processed it (binarization) and > before the character recognition. > [image: HookPostProcessed.jpg] > > tesseract suggests that there are 160 or so words (by some definition of > word!) on this page as per the attached (Hook02Small.txt). > > This also happens on pages which DO contain text but a small amount. I > suspect that the binarization (possibly OTSU?) is to blame. I can probable > do something to detect entirely blank pages, but less sure what do do with > mainly blank pages. > > Any suggestions most welcome! > > Iain > > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e78f6620-4019-4e36-95cf-0aad5194313dn%40googlegroups.com.

