I have a (Java) application that uses Tesseract on English-language docs that are often scanned poorly. I cannot change the quality of the scanning. However, the siginificant data in the documents is all in a single fixed-width font, all in capital letters. I don't know the name of the font, I just call it "old line printer" when I have to refer to it. I've attached an example. (Unfortunately the information in the documents is confidential, so I cannot post a complete example without major work redacting things.)
Besides the understandable substitutions of 0 for O and confusions between 1 and I, some of the scans have started having Tesseract mistake D for B, 0 for G, and other similar errors. It only happens, it seems to me, when the quality of the scan is poor. I am hoping that the very capable Tesseract engine can somehow be configured or trained or something to reduce these errors. I've seen references to "cleaning", to "training", and other things, but don't know what would be most appropriate here. I started looking at the documentation for training, but realized it was too much work to do on spec; I'm willing to do that if it's the best way to improve the tool for mjy situation, but would rather not do it before there's an informed opinion about whether it is. What should I be looking at doing? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/1c3f7fe3-69c6-4e36-ba81-3b5bbf9eb7bcn%40googlegroups.com.