Re: [tesseract-ocr] Suggestions wanted on how to improve recognition

Ralph Cook Mon, 01 Jul 2024 04:29:37 -0700

I"ll look into the scaling and denoising. 

I have no control over the input format. If you mean to take the TIFF image 
I've got and convert it before OCR, please say that.


Yes, the example I gave was not one of the noisy inputs. I've looked 
through the ones I have handy, and none of them seem to be that bad -- I'll 
look up some of poor quality and post those as well.

Thanks.

On Monday, July 1, 2024 at 4:18:31 AM UTC-4 ger.h...@gmail.com wrote:

> Hi, 
>
> More on this later (I seem to still have issues posting with attachments 
> here, plus running into a few surprises while doing bulk testing, so this 
> is preliminary):
>
> 1. Dont use lossy image file formats if you can, so PNG is better than 
> JPEG. From what I see, if you need lossy due to storage limitations, it 
> seems webp is better than JPEG. Has to do with the type of noise jpeg 
> introduces as "jpeg artifacts".
>
> 2. Scale (resize, use imagemagick or other tool to do this in bulk) the 
> input image to approximate 30px capital letter height for each line. That's 
> the ballpark, do try a couple of scales near that measure, e.g. test 
> results with a set of scaled images 5% off to see which scale is 'optimal' 
> for you. It can help to then run an additional test set with scales in a 
> 1-2% geometric scale range (i.e. next scale to try is 102% of previous 
> smaller test size).
>
> How to check: output both hocr and tsv outputs with character confidence 
> reporting turned on (tesseract hocr output for character confidence is 
> broken, those numbers only show in tsv), then read those files and check 
> both character and word confidence values output by tesseract. Pick the 
> scaling+misc preprocessing that gives you the highest numbers there on 
> average for your test set.
>
>
> After that, it depends...
>
> BTW: to my eye your image isn't noisy and you mention noise, hence: you 
> got a few rotten ones for us?  ;-)
>
>
> Re noise, preprocessing: what I find helps is killing (masking) all noise 
> that is a few pixels away from any character. Particularly when you are 
> processing low dpi / jpeg input. This must be done before feeding it to 
> tesseract as current tesseract does thresholding, etc for detecting the 
> spots where the text (words) are at, but the latest engine (LSTM) is fed 
> the raw input pixels so any useless noise ends up in there and degrades 
> output.
>
>
> TLDR:
>
> - scale
> - Denoise
> - enhance contrast (not necessary in your case)
> - ... other means to make image easier legible, anything goes ...
> - dictionary, etc. for tesseract or post: I see you've got jargon in there 
> (susp, iss, ...) which are not regular English dictionary words, so it 
> might help to use a custom dict, but don't have hard data on that one yet 
> myself)
>
>
>
>
> On Mon, 1 Jul 2024, 06:21 Ralph Cook, <rcja...@gmail.com> wrote:
>
>> I have an application using Tesseract on documents which are all in 
>> English, one font, everything I want to recognize is in capital letters, 
>> digits, and punctuation. 
>>
>> The quality of the scans is often poor, and I have no control over that. 
>> It's sometimes about what you would expect with pages that are scanned, 
>> printed, then scanned again; lots of noise, characters not distinct, etc.
>>
>> I don't know what the font is, I call it "Old Line Printer". Here's a 
>> sample:
>>
>> [image: Sample text anonymized.png]
>>
>> I have erased some identifying information and scratched some lines where 
>> it went.
>>
>> I am not familiar with OCR technology in general, nor with neural 
>> networks. I've read in the documentation abouto how to improve the image, 
>> some things about training, some things about how training is likely not 
>> necessary, etc. I'm looking for someone to recommend an overall strategy: 
>> what should I try first, what is the best 2nd plan, is there likely to be a 
>> 3rd, etc. I'm trying not to spend weeks studying the wrong things.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/185590fa-c34f-4775-a8a8-9f2bfd18c09en%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/185590fa-c34f-4775-a8a8-9f2bfd18c09en%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/96500561-67f9-4b00-a36c-56e214fcffcen%40googlegroups.com.

Re: [tesseract-ocr] Suggestions wanted on how to improve recognition

Reply via email to