Re: Franken+ Released -- New Tool For Training Tesseract on Fonts from Page Images

Janusz S. Bien Wed, 11 Dec 2013 21:02:31 -0800

Dear Matthew, thank you  for your long letter.

To make a long story short, I'm familiar with the old typographyproblems but I have no experience with tesseract training.

I may however point you to the report concerning an experimentconsisting in training tesseract on old Polish texts with the sameproblems which you describe:


http://lib.psnc.pl/publication/428

Both the texts, as images and PAGE files, are publicly available at

http://dl.psnc.pl/activities/projekty/impact/results/

Please note that the trained dataset is also available at

http://dl.psnc.pl/download/tesseract_traineddata.zip

The training used "classical" rectangular method.

To say the truth, I don't know how efficient the training was as I'mnot aware of any large scale application of the trained dataset. Usingit is one of the user options at Virtual Transcription Laboratory(http://wlt.synat.pcss.pl/wlt-web/index.xhtml), but I have no idea whouses it and for what.

It would be interesting to retrain tesseract using your approach onthe data described above and to compare the results, but I'm afraidnobody has time and motivation for it.


Best regards and good luck with your project

Janusz


--

Prof. dr hab. Janusz S. Bień - Uniwersytet Warszawski (KatedraLingwistyki Formalnej)

Prof. Janusz S. Bień - University of Warsaw (Formal Linguistics Department)
[email protected], [email protected], http://fleksem.klf.uw.edu.pl/~jsbien/

--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

---You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Franken+ Released -- New Tool For Training Tesseract on Fonts from Page Images

Reply via email to