Franken+ Released -- New Tool For Training Tesseract on Fonts from Page Images

matthew christy Fri, 06 Dec 2013 12:11:44 -0800

Hi All,

The Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas 
A&M University, as part of its Early Modern OCR Project 
(eMOP<http://emop.tamu.edu/>) 
has created a new tool, called Franken+, that provides a way to create font 
training for the Tesseract OCR engine using page images. This is in 
contrast to Tesseract's documented 
method<http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3>of font 
training which involves using a word processing program with a 
modern font. Franken+ has now been released for beta testing and we invite 
anyone who's interested to give it a try and to please provide feedback.


Franken+ works in conjunction with PRImA's open source Aletheia 
tool<http://www.primaresearch.org/tools.php>and allows users to easily and 
quickly identify one or more idealized forms 
of each glyph found on a set of page images. These identified forms are 
then used to generate a set of Franken-page images matching the page 
characteristics documented in Tesseract's training instructions, but with a 
font used in an actual early modern printed document. Franken+ allows you 
to create Tesseract box files, but will also guide you through the entire 
Tesseract training process, producing a .traneddata file, and even allow 
you to identify and OCR documents using that training. In addition, 
Franken+ makes it easy to combine training from multiple fonts into one 
training set.

For eMOP we are using Franken+ to create training for Tesseract from page 
images of early modern printed works, but we also think it can be used just 
as effectively to train Tesseract using images of any kind of font that's 
not readily available via a word processor. For example, I've seen posts in 
this group about wanting to train Tesseract to read the signs on the front 
of buses.

You can find out more about Franken+ at http://emop.tamu.edu/node/54 and 
http://dh-emopweb.tamu.edu/Franken+/. The code is also available open 
source at https://github.com/idhmc-tamu/eMOP/tree/master/Franken%2B.

Thanks,
Matt Christy

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Franken+ Released -- New Tool For Training Tesseract on Fonts from Page Images

Reply via email to