I"ll look into the scaling and denoising. I have no control over the input format. If you mean to take the TIFF image I've got and convert it before OCR, please say that.
Yes, the example I gave was not one of the noisy inputs. I've looked through the ones I have handy, and none of them seem to be that bad -- I'll look up some of poor quality and post those as well. Thanks. On Monday, July 1, 2024 at 4:18:31 AM UTC-4 ger.h...@gmail.com wrote: > Hi, > > More on this later (I seem to still have issues posting with attachments > here, plus running into a few surprises while doing bulk testing, so this > is preliminary): > > 1. Dont use lossy image file formats if you can, so PNG is better than > JPEG. From what I see, if you need lossy due to storage limitations, it > seems webp is better than JPEG. Has to do with the type of noise jpeg > introduces as "jpeg artifacts". > > 2. Scale (resize, use imagemagick or other tool to do this in bulk) the > input image to approximate 30px capital letter height for each line. That's > the ballpark, do try a couple of scales near that measure, e.g. test > results with a set of scaled images 5% off to see which scale is 'optimal' > for you. It can help to then run an additional test set with scales in a > 1-2% geometric scale range (i.e. next scale to try is 102% of previous > smaller test size). > > How to check: output both hocr and tsv outputs with character confidence > reporting turned on (tesseract hocr output for character confidence is > broken, those numbers only show in tsv), then read those files and check > both character and word confidence values output by tesseract. Pick the > scaling+misc preprocessing that gives you the highest numbers there on > average for your test set. > > > After that, it depends... > > BTW: to my eye your image isn't noisy and you mention noise, hence: you > got a few rotten ones for us? ;-) > > > Re noise, preprocessing: what I find helps is killing (masking) all noise > that is a few pixels away from any character. Particularly when you are > processing low dpi / jpeg input. This must be done before feeding it to > tesseract as current tesseract does thresholding, etc for detecting the > spots where the text (words) are at, but the latest engine (LSTM) is fed > the raw input pixels so any useless noise ends up in there and degrades > output. > > > TLDR: > > - scale > - Denoise > - enhance contrast (not necessary in your case) > - ... other means to make image easier legible, anything goes ... > - dictionary, etc. for tesseract or post: I see you've got jargon in there > (susp, iss, ...) which are not regular English dictionary words, so it > might help to use a custom dict, but don't have hard data on that one yet > myself) > > > > > On Mon, 1 Jul 2024, 06:21 Ralph Cook, <rcja...@gmail.com> wrote: > >> I have an application using Tesseract on documents which are all in >> English, one font, everything I want to recognize is in capital letters, >> digits, and punctuation. >> >> The quality of the scans is often poor, and I have no control over that. >> It's sometimes about what you would expect with pages that are scanned, >> printed, then scanned again; lots of noise, characters not distinct, etc. >> >> I don't know what the font is, I call it "Old Line Printer". Here's a >> sample: >> >> [image: Sample text anonymized.png] >> >> I have erased some identifying information and scratched some lines where >> it went. >> >> I am not familiar with OCR technology in general, nor with neural >> networks. I've read in the documentation abouto how to improve the image, >> some things about training, some things about how training is likely not >> necessary, etc. I'm looking for someone to recommend an overall strategy: >> what should I try first, what is the best 2nd plan, is there likely to be a >> 3rd, etc. I'm trying not to spend weeks studying the wrong things. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/185590fa-c34f-4775-a8a8-9f2bfd18c09en%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/185590fa-c34f-4775-a8a8-9f2bfd18c09en%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/96500561-67f9-4b00-a36c-56e214fcffcen%40googlegroups.com.