My suggestion would be to do post processing of the OCR output. On Mon 2 Apr, 2018, 6:09 PM JP T, <jpt.unterw...@gmail.com> wrote:
> Hi > > I don't really got an understanding of the consequences of training. > > My problem: > I've got tons of pages with a special format. ("one place study" about the > historic inhabitants of a town) > > tesseract repeatedly fails on a few special words: > oo (oh-oh) at start of line for "wedding" is often interpreted as 00 (zero > zero) > roman numbers 2 and 3 in Arial font are taken for lowercase LL or > uppercase I plus lowercase LL > */~ (birth at about) is percent % > ~ is - > > my scans are of almost perfect quality (used Fred's scripts). so there is > nothing I can do on that side any more. > adding oo to user words did not help. > > Can I use training to solve these or should I instead write a script that > fixes the mistakes after OCR? > The problem is, that OCR needs to know some semantics. The Arial letters > itself do hardly provide a hint which one is correct. > > thanks > > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/5cd68a84-a7d2-4185-91c9-115c9e62d1d4%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/5cd68a84-a7d2-4185-91c9-115c9e62d1d4%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXnu95%3DKnW5qK1-%2Brmxpt1BZ5pH6z0qi4CtYVzMiSGGVQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.