Re: Franken+ Released -- New Tool For Training Tesseract on Fonts from Page Images

Bryan Tarpley Mon, 09 Dec 2013 10:47:54 -0800

Folks,

I'm the developer of Franken+.  I appreciate you taking your time to 
provide some off-the-cuff remarks.  I must agree with everyone here so far: 
 it would be ideal to have a tool like Franken+ that ingests tiff/box pairs 
(not tif/PAGE XML) that also runs on something other than Windows.  So why 
did we go this route?  Because we're a grant funded project with limited 
funds and hard deadlines.  To put this in perspective a little better, the 
eMOP project is tasked with OCR'ing over 45 million tiff images--all poor 
scans of documents that were printed using early printing presses--within 
two years.  This is an example of the cringe-worthy documents we're dealing 
with:  
http://sarahwerner.net/blog/wp-content/uploads/2012/04/eebo-hamlet-1024x709.jpg


We began using Aletheia because it was the only tool we were aware of at 
the time which allows us to binarize an image, clean up artifacts, and 
bound not only characters but words, lines, paragraphs, columns, pages, etc 
for font-training purposes.  The student workers who we pay to do much of 
this work have varying levels of comfort/expertise with computers, so 
Aletheia also proved to be the most GUI driven, user-friendly tool out 
there.  Before we found that we needed to develop Franken+, we'd already 
spent hundreds of hours amassing training data using Aletheia.  We continue 
to use Aletheia because of its ability to draw polygons around characters 
(not boxes).  If you look at the example image I linked above, you can see 
that there are many instances of characters where it would be impossible to 
draw a box around it and isolate only a single character.  Unfortunately, 
there are no open-source tools that we were aware of at the time that 
allowed us to block off characters using polygons, not boxes.  If anyone is 
willing to develop such a tool (or find one that already exists), I'd be 
happy (in my free time) to modify Franken+ such that it ingests that kind 
of thing.

We never expected to have to develop Franken+, so we used the .NET platform 
because that is the environment for which I feel most comfortable 
developing, and my time was very limited (I developed this as a graduate 
research assistant, squeezing in hours between my PhD studies in English 
and my other duties at the IDHMC).  Since it relies on Aletheia (which also 
must run on Windows), this was not a problem for us.

Respectfully,
Bryan




On Monday, December 9, 2013 9:20:39 AM UTC-6, matthew christy wrote:
>
> Hi all,
>
> Thanks for the comments. I was not aware that there were concerns with 
> Aletheia's availability or trouble getting access to the tool. We have not 
> had any problems with that ourselves. 
>
> We are only using Aletheia as a tool to identify each glyph and create 
> tiff/box file pairs for each page processed. We are not using the PAGE 
> format or anything like that. Early in our process of trying to create 
> training for Tesseract with early modern printed documents we had begun to 
> use Aletheia as a tool to create tiff/box pairs for Tesseract (with some 
> translation, of course). At the point that we decided that we had to create 
> some new mechanism for training Tesseract, we already had quite a lot of 
> page images processed with Aletheia which had all been corrected and 
> checked by hand. So we started from that when creating Franken+. 
>
> We have discussed internally some of the suggestions that you are all 
> making above. Being able to use Tesseract's built in box file generator and 
> jTessBoxEditor as input instead of Aletheia is one. Creating a version that 
> runs on other platforms is another that would follow the above (as Aletheia 
> also only runs on Windows). Making our own repository for Franken+ is also 
> a good idea. It will take some time to get to all of that though. In the 
> meantime we wanted to share what we had created, and this is a beta 
> release. As open-source code we also hope that others will feel free to 
> make some of these changes themselves and share with the rest of us. In 
> fact, I think getting Franken+ to work with Tesseract/jTessBoxEditor input 
> should be a simple matter of adjusting the coordinate system that Franken+ 
> is expecting in the incoming box files (since Tesseract and Aletheia box 
> files have 0,0 in different corners).
>
> -
>
> Franken+ was created not only to allow us to identify the best possible 
> exemplars of each glyph in our training documents, but to generate tiffs 
> with accompanying box files that could be used to train Tesseract as well. 
> The early modern printed documents that we are trying to use to train 
> Tesseract are far from ideal. They suffer from many problems introduced at 
> every stage of the process from the original printing, to 250-550 years of 
> use and storage, to the digitization of the document. We found in early 
> testing that the more examples of glyphs we tried to train Tesseract 
> with--often with highly variable example images for each glyph--the worse 
> Tesseract did. 
>
> In a nutshell Franken+ allows us to process several pages (with Aletheia), 
> see every exemplar of each glyph discovered, pick some small number of 
> ideal samples for each (we are in the process of testing whether Tesseract 
> does better when Franken+ is used to pick one example of each glyph, 5 
> examples, or more), and generate a set of tiff images and box files that 
> are produced by creating a Franken-doc using only those set of exemplars 
> identified in Franken+. However, we have found that using Franken+ has 
> other advantages: being able to easily identify miss-labeled glyphs, doing 
> typeface comparison of glyphs in a document, quickly identifying and 
> removing "junk" glyphs, being able to quickly identify the different point 
> sizes used in a doc, etc. And we also added some extra features that allow 
> users to complete the Tesseract training process with Franken+ rather than 
> having to go back to the command line.
>
> We have seen a variable amount of improvement in our OCR results with 
> Tesseract using training generated with Franken+. Some of that improvement 
> has been quite good; upwards of 15-20%, without adding dictionaries. We are 
> continuing to test variables in Franken+ training to see what generates the 
> best results. We'll add all of that information to the eMOP and Franken+ 
> pages when we have it.
>
> We do appreciate the comments and suggestions and if anyone is interested 
> in getting Franken+ to work with Tesseract's tiff/box pairs before we can 
> get to it, please do. We are always happy to get help and we do want 
> Franken+ to be an active open-source project.
> Thanks,
> Matt Christy
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Franken+ Released -- New Tool For Training Tesseract on Fonts from Page Images

Reply via email to