I tried your example image with tesseract executable: > tesseract FontExample.png - -c preserve_interword_spaces=1 #*%% DRIVER LICENSE STATUS: CLS C SUSPENDED *xx
LIC LMT COND CLASS GRP TYP ISSUE DT EXPIR DT CDL DISQ PROB PRIV RESTR STATUS I D 06-16-22 03-22-29 N N N N N ID CARD ENDORS: As far as I see there is problem is with "***" only (tesseract has problem with repeating symbols) - no problem with D vs B, 0 vs G. Try to check which PSM and trainneddata (tessdata, best, fast) is used by your app (My test used https://github.com/tesseract-ocr/tessdata). Zdenko st 24. 5. 2023 o 1:05 Ralph Cook <rcjava...@gmail.com> napĂsal(a): > I have a (Java) application that uses Tesseract on English-language docs > that are often scanned poorly. I cannot change the quality of the scanning. > However, the siginificant data in the documents is all in a single > fixed-width font, all in capital letters. I don't know the name of the > font, I just call it "old line printer" when I have to refer to it. I've > attached an example. (Unfortunately the information in the documents is > confidential, so I cannot post a complete example without major work > redacting things.) > > Besides the understandable substitutions of 0 for O and confusions between > 1 and I, some of the scans have started having Tesseract mistake D for B, 0 > for G, and other similar errors. It only happens, it seems to me, when the > quality of the scan is poor. > > I am hoping that the very capable Tesseract engine can somehow be > configured or trained or something to reduce these errors. I've seen > references to "cleaning", to "training", and other things, but don't know > what would be most appropriate here. I started looking at the documentation > for training, but realized it was too much work to do on spec; I'm willing > to do that if it's the best way to improve the tool for mjy situation, but > would rather not do it before there's an informed opinion about whether it > is. > > What should I be looking at doing? > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/1c3f7fe3-69c6-4e36-ba81-3b5bbf9eb7bcn%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/1c3f7fe3-69c6-4e36-ba81-3b5bbf9eb7bcn%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8yE79Ap1JqMKM6EdQaY6QC%2Bdg-5v6wVuQsycasL_HZouw%40mail.gmail.com.