I have a (Java) application that uses Tesseract on English-language docs 
that are often scanned poorly. I cannot change the quality of the scanning. 
However, the siginificant data in the documents is all in a single 
fixed-width font, all in capital letters. I don't know the name of the 
font, I just call it "old line printer" when I have to refer to it. I've 
attached an example. (Unfortunately the information in the documents is 
confidential, so I cannot post a complete example without major work 
redacting things.)

Besides the understandable substitutions of 0 for O and confusions between 
1 and I, some of the scans have started having Tesseract mistake D for B, 0 
for G, and other similar errors. It only happens, it seems to me, when the 
quality of the scan is poor.

I am hoping that the very capable Tesseract engine can somehow be 
configured or trained or something to reduce these errors. I've seen 
references to "cleaning", to "training", and other things, but don't know 
what would be most appropriate here. I started looking at the documentation 
for training, but realized it was too much work to do on spec; I'm willing 
to do that if it's the best way to improve the tool for mjy situation, but 
would rather not do it before there's an informed opinion about whether it 
is. 

What should I be looking at doing?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1c3f7fe3-69c6-4e36-ba81-3b5bbf9eb7bcn%40googlegroups.com.

Reply via email to