TIFF should be okay (IIRC that not a lossy compression format, usually). The advice re image formats is most relevant when you preprocess your scanned TIFF images: always use a lossless format, e.g. PNG, as intermediate output format, so when, for example, using imagemagick, do
magick -input.tiff -resize WxH image.png tesseract ........ image.png instead of magick -input.tiff -resize WxH image.jpg tesseract ........ image.jpg Cheers, Ger On Monday, July 1, 2024 at 1:29:33 PM UTC+2 rcja...@gmail.com wrote: > I"ll look into the scaling and denoising. > > I have no control over the input format. If you mean to take the TIFF > image I've got and convert it before OCR, please say that. > > Yes, the example I gave was not one of the noisy inputs. I've looked > through the ones I have handy, and none of them seem to be that bad -- I'll > look up some of poor quality and post those as well. > > Thanks. > > On Monday, July 1, 2024 at 4:18:31 AM UTC-4 ger.h...@gmail.com wrote: > >> Hi, >> >> More on this later (I seem to still have issues posting with attachments >> here, plus running into a few surprises while doing bulk testing, so this >> is preliminary): >> >> 1. Dont use lossy image file formats if you can, so PNG is better than >> JPEG. From what I see, if you need lossy due to storage limitations, it >> seems webp is better than JPEG. Has to do with the type of noise jpeg >> introduces as "jpeg artifacts". >> >> 2. Scale (resize, use imagemagick or other tool to do this in bulk) the >> input image to approximate 30px capital letter height for each line. That's >> the ballpark, do try a couple of scales near that measure, e.g. test >> results with a set of scaled images 5% off to see which scale is 'optimal' >> for you. It can help to then run an additional test set with scales in a >> 1-2% geometric scale range (i.e. next scale to try is 102% of previous >> smaller test size). >> >> How to check: output both hocr and tsv outputs with character confidence >> reporting turned on (tesseract hocr output for character confidence is >> broken, those numbers only show in tsv), then read those files and check >> both character and word confidence values output by tesseract. Pick the >> scaling+misc preprocessing that gives you the highest numbers there on >> average for your test set. >> >> >> After that, it depends... >> >> BTW: to my eye your image isn't noisy and you mention noise, hence: you >> got a few rotten ones for us? ;-) >> >> >> Re noise, preprocessing: what I find helps is killing (masking) all noise >> that is a few pixels away from any character. Particularly when you are >> processing low dpi / jpeg input. This must be done before feeding it to >> tesseract as current tesseract does thresholding, etc for detecting the >> spots where the text (words) are at, but the latest engine (LSTM) is fed >> the raw input pixels so any useless noise ends up in there and degrades >> output. >> >> >> TLDR: >> >> - scale >> - Denoise >> - enhance contrast (not necessary in your case) >> - ... other means to make image easier legible, anything goes ... >> - dictionary, etc. for tesseract or post: I see you've got jargon in >> there (susp, iss, ...) which are not regular English dictionary words, so >> it might help to use a custom dict, but don't have hard data on that one >> yet myself) >> >> >> >> >> On Mon, 1 Jul 2024, 06:21 Ralph Cook, <rcja...@gmail.com> wrote: >> >>> I have an application using Tesseract on documents which are all in >>> English, one font, everything I want to recognize is in capital letters, >>> digits, and punctuation. >>> >>> The quality of the scans is often poor, and I have no control over that. >>> It's sometimes about what you would expect with pages that are scanned, >>> printed, then scanned again; lots of noise, characters not distinct, etc. >>> >>> I don't know what the font is, I call it "Old Line Printer". Here's a >>> sample: >>> >>> [image: Sample text anonymized.png] >>> >>> I have erased some identifying information and scratched some lines >>> where it went. >>> >>> I am not familiar with OCR technology in general, nor with neural >>> networks. I've read in the documentation abouto how to improve the image, >>> some things about training, some things about how training is likely not >>> necessary, etc. I'm looking for someone to recommend an overall strategy: >>> what should I try first, what is the best 2nd plan, is there likely to be a >>> 3rd, etc. I'm trying not to spend weeks studying the wrong things. >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/185590fa-c34f-4775-a8a8-9f2bfd18c09en%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/185590fa-c34f-4775-a8a8-9f2bfd18c09en%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ca22c9e7-4937-4544-9922-f77cc654d2abn%40googlegroups.com.