Re: [tesseract-ocr] Want advice on how to proceed with Tesseract and reducing recognition errors

Zdenko Podobny Wed, 24 May 2023 22:09:25 -0700

I tried your example image with tesseract executable:

> tesseract FontExample.png - -c preserve_interword_spaces=1
#*%% DRIVER LICENSE STATUS: CLS C SUSPENDED *xx


LIC                                     LMT COND
CLASS GRP TYP ISSUE DT EXPIR DT CDL DISQ PROB PRIV RESTR    STATUS
I   D 06-16-22 03-22-29 N    N     N     N    N    ID CARD

ENDORS:

As far as I see there is  problem is with "***" only (tesseract has problem
with repeating symbols) - no problem with  D vs B, 0 vs G.
Try to check which PSM and trainneddata  (tessdata, best, fast) is used by
your app (My test used https://github.com/tesseract-ocr/tessdata).

Zdenko


st 24. 5. 2023 o 1:05 Ralph Cook <rcjava...@gmail.com> napísal(a):

> I have a (Java) application that uses Tesseract on English-language docs
> that are often scanned poorly. I cannot change the quality of the scanning.
> However, the siginificant data in the documents is all in a single
> fixed-width font, all in capital letters. I don't know the name of the
> font, I just call it "old line printer" when I have to refer to it. I've
> attached an example. (Unfortunately the information in the documents is
> confidential, so I cannot post a complete example without major work
> redacting things.)
>
> Besides the understandable substitutions of 0 for O and confusions between
> 1 and I, some of the scans have started having Tesseract mistake D for B, 0
> for G, and other similar errors. It only happens, it seems to me, when the
> quality of the scan is poor.
>
> I am hoping that the very capable Tesseract engine can somehow be
> configured or trained or something to reduce these errors. I've seen
> references to "cleaning", to "training", and other things, but don't know
> what would be most appropriate here. I started looking at the documentation
> for training, but realized it was too much work to do on spec; I'm willing
> to do that if it's the best way to improve the tool for mjy situation, but
> would rather not do it before there's an informed opinion about whether it
> is.
>
> What should I be looking at doing?
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/1c3f7fe3-69c6-4e36-ba81-3b5bbf9eb7bcn%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/1c3f7fe3-69c6-4e36-ba81-3b5bbf9eb7bcn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8yE79Ap1JqMKM6EdQaY6QC%2Bdg-5v6wVuQsycasL_HZouw%40mail.gmail.com.

Re: [tesseract-ocr] Want advice on how to proceed with Tesseract and reducing recognition errors

Reply via email to