[tesseract-ocr] Suggestions wanted on how to improve recognition

Ralph Cook Sun, 30 Jun 2024 21:21:21 -0700

I have an application using Tesseract on documents which are all in 
English, one font, everything I want to recognize is in capital letters, 
digits, and punctuation.


The quality of the scans is often poor, and I have no control over that. 
It's sometimes about what you would expect with pages that are scanned, 
printed, then scanned again; lots of noise, characters not distinct, etc.

I don't know what the font is, I call it "Old Line Printer". Here's a 
sample:

[image: Sample text anonymized.png]

I have erased some identifying information and scratched some lines where 
it went.

I am not familiar with OCR technology in general, nor with neural networks. 
I've read in the documentation abouto how to improve the image, some things 
about training, some things about how training is likely not necessary, 
etc. I'm looking for someone to recommend an overall strategy: what should 
I try first, what is the best 2nd plan, is there likely to be a 3rd, etc. 
I'm trying not to spend weeks studying the wrong things.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/185590fa-c34f-4775-a8a8-9f2bfd18c09en%40googlegroups.com.

[tesseract-ocr] Suggestions wanted on how to improve recognition

Reply via email to