tesseract unnamed.jpg - Estimating resolution as 182 e.g. no recognized word... So the problem could be in the parameters you used for OCR...
Before OCR I suggest image preprocessing and maybe the detection of empty pages. Have a look at leptonica example for Normalize for uneven illumination (pixBackgroundNorm in https://github.com/DanBloomberg/leptonica/blob/master/prog/livre_adapt.c) and then binarize image. I think with some more "aggressive" parameters you can get a clean empty page, so will not need to modify your OCR parameters... Zdenko ne 4. 8. 2024 o 13:22 Iain Downs <i...@idcl.co.uk> napísal(a): > In the event that anyone else has a similar issue, this is how I > approached it. > > Firstly, make a histogram of the number of pixels with each intensity (so > an array of 256 numbers). > > When you inspect this you get results like the below. > > [image: Finding empty pages.png] > > This is after a little smoothing and taking the log of the values. > > You can see that the properly blank pages show little or no very dark > (black) pixels, whereas the pages with some text, even if a small amount > have a fair number. > > I simply set a cutoff level (in this case 1) and a cutoff intensity (in my > case 80), so providing the first peak of 1 of the log smoothed intensity is > below 80 it is text, otherwise it is blank. > > You can also see the problem which tesseract has (with default > binarisation) in that the intensity is distinctly bimodal. I think this is > due to bleedthrough from the reverse of the page. Of course that is > essentially what OTSU uses pick out 'black' from 'white'. > > Iain > On Tuesday, July 16, 2024 at 5:38:02 PM UTC+1 Iain Downs wrote: > >> I'm working on processing scanned paperback books with tesseract (C++ API >> at the moment). One issue I've found is that when a page has little or no >> text tesseract gets overkeen and interprets the noise as text. >> >> The image below is the raw page. In this case it's the inside front >> cover of a book. >> [image: HookRawPage.jpg] >> This is the image after tesseract has processed it (binarization) and >> before the character recognition. >> [image: HookPostProcessed.jpg] >> >> tesseract suggests that there are 160 or so words (by some definition of >> word!) on this page as per the attached (Hook02Small.txt). >> >> This also happens on pages which DO contain text but a small amount. I >> suspect that the binarization (possibly OTSU?) is to blame. I can probable >> do something to detect entirely blank pages, but less sure what do do with >> mainly blank pages. >> >> Any suggestions most welcome! >> >> Iain >> >> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/e78f6620-4019-4e36-95cf-0aad5194313dn%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/e78f6620-4019-4e36-95cf-0aad5194313dn%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8x6_gs_RYvHR83BbZoO2tKvDW_V-hyF1NC2osZ1y2LmxA%40mail.gmail.com.