Re: Franken+ Released -- New Tool For Training Tesseract on Fonts from Page Images

matthew christy Fri, 06 Dec 2013 13:28:41 -0800

Hi Janusz,

You're right, Aletheia is not open-source. My mistake on a poor choice of 
words. However, it is free to use after registering, which is also free. 
The only restriction that I'm sure about on it's use is in a commercial 
product. I'll see if I can get a comment on that from someone at PRImA.


Thanks,
Matt

On Friday, December 6, 2013 2:10:56 PM UTC-6, matthew christy wrote:
>
> Hi All,
>
> The Initiative for Digital Humanities, Media, and Culture (IDHMC) at Texas 
> A&M University, as part of its Early Modern OCR Project 
> (eMOP<http://emop.tamu.edu/>) 
> has created a new tool, called Franken+, that provides a way to create font 
> training for the Tesseract OCR engine using page images. This is in 
> contrast to Tesseract's documented 
> method<http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3>of font 
> training which involves using a word processing program with a 
> modern font. Franken+ has now been released for beta testing and we invite 
> anyone who's interested to give it a try and to please provide feedback.
>
> Franken+ works in conjunction with PRImA's open source Aletheia 
> tool<http://www.primaresearch.org/tools.php>and allows users to easily and 
> quickly identify one or more idealized forms 
> of each glyph found on a set of page images. These identified forms are 
> then used to generate a set of Franken-page images matching the page 
> characteristics documented in Tesseract's training instructions, but with a 
> font used in an actual early modern printed document. Franken+ allows you 
> to create Tesseract box files, but will also guide you through the entire 
> Tesseract training process, producing a .traneddata file, and even allow 
> you to identify and OCR documents using that training. In addition, 
> Franken+ makes it easy to combine training from multiple fonts into one 
> training set.
>
> For eMOP we are using Franken+ to create training for Tesseract from page 
> images of early modern printed works, but we also think it can be used just 
> as effectively to train Tesseract using images of any kind of font that's 
> not readily available via a word processor. For example, I've seen posts in 
> this group about wanting to train Tesseract to read the signs on the front 
> of buses.
>
> You can find out more about Franken+ at http://emop.tamu.edu/node/54 and 
> http://dh-emopweb.tamu.edu/Franken+/. The code is also available open 
> source at https://github.com/idhmc-tamu/eMOP/tree/master/Franken%2B.
>
> Thanks,
> Matt Christy
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Franken+ Released -- New Tool For Training Tesseract on Fonts from Page Images

Reply via email to