My suggestion would be to do post processing of the OCR output.

On Mon 2 Apr, 2018, 6:09 PM JP T, <jpt.unterw...@gmail.com> wrote:

> Hi
>
> I don't really got an understanding of the consequences of training.
>
> My problem:
> I've got tons of pages with a special format. ("one place study" about the
> historic inhabitants of a town)
>
> tesseract repeatedly fails on a few special words:
> oo (oh-oh) at start of line for "wedding" is often interpreted as 00 (zero
> zero)
> roman numbers 2 and 3 in Arial font are taken for lowercase LL or
> uppercase I plus lowercase LL
> */~ (birth at about) is percent %
> ~ is -
>
> my scans are of almost perfect quality (used Fred's scripts). so there is
> nothing I can do on that side any more.
> adding oo to user words did not help.
>
> Can I use training to solve these or should I instead write a script that
> fixes the mistakes after OCR?
> The problem is, that OCR needs to know some semantics. The Arial letters
> itself do hardly provide a hint which one is correct.
>
> thanks
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/5cd68a84-a7d2-4185-91c9-115c9e62d1d4%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/5cd68a84-a7d2-4185-91c9-115c9e62d1d4%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXnu95%3DKnW5qK1-%2Brmxpt1BZ5pH6z0qi4CtYVzMiSGGVQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to