Hi All, The Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas A&M University, as part of its Early Modern OCR Project (eMOP<http://emop.tamu.edu/>) has created a new tool, called Franken+, that provides a way to create font training for the Tesseract OCR engine using page images. This is in contrast to Tesseract's documented method<http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3>of font training which involves using a word processing program with a modern font. Franken+ has now been released for beta testing and we invite anyone who's interested to give it a try and to please provide feedback.
Franken+ works in conjunction with PRImA's open source Aletheia tool<http://www.primaresearch.org/tools.php>and allows users to easily and quickly identify one or more idealized forms of each glyph found on a set of page images. These identified forms are then used to generate a set of Franken-page images matching the page characteristics documented in Tesseract's training instructions, but with a font used in an actual early modern printed document. Franken+ allows you to create Tesseract box files, but will also guide you through the entire Tesseract training process, producing a .traneddata file, and even allow you to identify and OCR documents using that training. In addition, Franken+ makes it easy to combine training from multiple fonts into one training set. For eMOP we are using Franken+ to create training for Tesseract from page images of early modern printed works, but we also think it can be used just as effectively to train Tesseract using images of any kind of font that's not readily available via a word processor. For example, I've seen posts in this group about wanting to train Tesseract to read the signs on the front of buses. You can find out more about Franken+ at http://emop.tamu.edu/node/54 and http://dh-emopweb.tamu.edu/Franken+/. The code is also available open source at https://github.com/idhmc-tamu/eMOP/tree/master/Franken%2B. Thanks, Matt Christy -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

