Please try tesseract 4.0.0beta.1 with languages such as *enm* (English, Middle (1100-1500))
and Fraktur script Also, look at the following project from a few years back http://emop.tamu.edu/outcomes/Franken-Plus ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Mon, Mar 12, 2018 at 4:32 AM, Guillaume Desforges <aceu...@gmail.com> wrote: > Hi > > I want to try using Tesseract 4 for old manuscript languages ("The Song of > Roland" and such). > > I have looked at https://github.com/tesseract-ocr/tesseract/wiki/ > TrainingTesseract-4.00 but the steps are very unclear. > > I have an image and a text file with the line content for each line of > manuscript text. The doc says what to do, but not how. > > I first thought I'd need img/box files pairs, but it seems it was for > Tesseract 3 and is now irrelevant... > > So I guess my starting point is here : https://github.com/ > tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00# > tutorial-guide-to-lstmtraining > > There is no tool to create the lstm-recoder directly. Instead there is a >> new tool, combine_lang_model which takes as input an input_unicharset >> and script_dir(script_dir points to the langdata directory) and >> optional word list files. It creates the lstm-recoder from the >> input_unicharset and creates all the dawgs, if wordlists are provided, >> putting everything together into a traineddata file. > > > I don't really get this part. How do I make input_unicharset ? What is > langdata? > > Thanks > > Guillaume Desforges > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/fe1d68a2-76ce-4005-98ea-672710365517% > 40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/fe1d68a2-76ce-4005-98ea-672710365517%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVyE8w9vtpvXnDX6-KKr5Drpy9Rh1AazTHCgTLKMOFyVA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.