Re: Franken+ Released -- New Tool For Training Tesseract on Fonts from Page Images

matthew christy Wed, 11 Dec 2013 06:17:40 -0800

>
> Hi Janusz,
>
> There are a couple of things I'd like to point out. First of all, you've 
> mentioned 19th Century typefaces in the past, so I'm assuming that that's 
> what you're used to working with. We're dealing with 15th-18th Century 
> documents. Like Bryan, I'm not a font history expert, but from what I've 
> learned over the last year, I'm willing to bet that printing practices and 
> standards in those early centuries of printing were probably a bit 
> different from what they ended up being as everything became more 
> established. Most of the typefaces we are looking at (if not all) were made 
> by hand and so can have quite individual peculiarities. As Nick pointed 
> out, it was not uncommon to create print blocks that contained two or three 
> common letter combinations on one punch (I don't think that's the 
> technically correct word, but I'll use it anyway). They were like ligatures 
> in a way even though the letters weren't actually connected. I'm going to 
> call these unconnected ligatures just for ease of reference throughout this 
> post. 
>
> If you look closely at this Specimen Sheet from the type caster Francois 
> Guyot (http://collation.folger.edu/2011/09/guyots-speciman-sheet/) you'll 
> see a number of such unconnected ligatures, and we've seen others as Bryan 
> noted. You'll also see a number of upper-case letters which overhang or run 
> under their adjacent letters. The upper-case Q is a common example of this. 
> Most of these are in the italics set, but not all. 
>
> Owing to the individualistic nature of these typefaces, we are faced with 
> the possibility of having to train Tesseract on every possible 
> typeface--something that is prohibitively expensive, if even possible. We 
> have used Aletheia to train several different typefaces so far, but if we 
> tried to created training for every hand made typeface created over the 
> course of 250 years, we would never finish. Thankfully it is the case that 
> certain type casters were quite influential and that some typefaces in 
> certain places would become "fashionable". So often typefaces from 
> different casters can be quite similar to each other. But just because a 
> type caster made his 'e' look like Guyot's 'e' doesn't mean that he didn't 
> also decide to create a bunch of unconnected ligatures in his type set, or 
> not create the same ones that Guyot thought was important, etc. In fact, 
> due to the inconsistent output of printing presses from this time, I've 
> found that two lower-case e characters from specimen sheets produced 200 
> years apart can look more like each other than two lower-case e characters 
> printed on the same page of just one document using one of those typefaces. 
> Therefore we are pursuing the possibility that we can train Tesseract to 
> recognize "families" of typefaces which are similar enough to each other 
> that they won't require training Tesseract for each typeface (not to 
> mention the problem of then identifying the documents in our collections 
> which use each typeface).
>
> Doing this however, means that the idea of training Tesseract (using only 
> square boxes) to recognize every possible unconnected ligature in our 
> corpus would again be prohibitively expensive (both in terms of time and 
> the expertise required), and probably not possible. If we only used boxes 
> in training Tesseract, we'd have to closely examine every document which we 
> would be OCR'ing with that training in order to make sure that we 
> identified (and collected multiple samples of) each unconnected ligature to 
> add to the training. Otherwise Tesseract won't recognize them. That would 
> seem to defeat the purpose of using a computer to try to optically 
> recognize the characters. It makes much more sense to pull these 
> unconnected ligatures apart and train Tesseract to recognize each character 
> separately so as to increase Tesseract's ability to recognize these 
> characters on multiple documents whether they were printed as unconnected 
> ligatures or not. As Bryan noted, for connected ligatures, like 'sh' 'st' 
> 'ff', etc. we are of course training Tesseract to recognize them as one 
> glyph. And in that work we are using MUFI's unicode values, and even some 
> privately assigned ones (which we have documented by adding them to the 
> list created by PRImA for IMPACT at 
> http://tools.primaresearch.org/Special%20Characters%20in%20Aletheia.pdf).
>
> Besides, creating space between character glyphs during training is 
> exactly what's described in Tesseract's own training procedures. That's why 
> we created Franken+: so that we could identify each glyph in a document, 
> and create a Franken-document of tiffs, that match what Tesseract's 
> training document says it needs to be trained with.
>
> Another thing is that it is quite common in the documents that we are 
> OCR'ing for standard and italics type to be present on the same page and 
> even the same line. It's even not uncommon at all for documents to be 
> printed with both roman and blackletter fonts throughout the document, 
> again on the same lines. So we need to be able to train Tesseract to 
> recognize both standard and italics. For the italic typefaces, the letters 
> overlap quite often, so here using square blocks wouldn't work. I'm sure 
> that there are some other techniques available to train for italics, but 
> creating a training system that was consistent and easy to use for all the 
> typefaces we are dealing with was a primary goal, as we would not be able 
> to complete our work in the time allowed without the help of unskilled 
> labor.
>
> I'd also like to point out that none of the examples that we've provided 
> in any of these discussions represent unusual or special situations. They 
> are VERY TYPICAL for the documents we are dealing with. We also recognize 
> that there are going to be other cases in the 45 million page images we 
> have that none of our team has ever seen before. So we feel that it is 
> essential for us to create training that is "generic" in order to get 
> Tesseract to recognize as many glyphs as possible without requiring us to 
> identify every special case before hand. There will of course be special 
> cases that Tesseract will fail to recognize during the OCR'ing of 45 
> million pages, which is why we are currently working so hard to create a 
> robust, machine learning-based, post-processing triage system to help us 
> identify these failures.
>
> I do understand what you're saying Janusz, and I think that if we were 
> dealing with a much smaller and more specific set of documents from a much 
> shorter time period, we could probably afford to be more specific in our 
> training. But we're not, and so some of the things you're talking about 
> doing just won't work for this project.
>
> Also, just so you know, we started by trying to train Tesseract using 
> high-quality page images of documents printed in typefaces we knew we were 
> interested in. These page images were of much better quality than the ones 
> we'll actually be OCR'ing. The results were terrible. We were lucky if we 
> could get Tesseract to recognize 80% of the words on the exact same page 
> we'd used to train Tesseract. And that was including using dictionaries and 
> a unicharambigs file that was created to address the errors Tesseract was 
> making on OCR'ing that page. That's why we created Franken+.
>
> Thanks again,
> Matt Christy
>


-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Franken+ Released -- New Tool For Training Tesseract on Fonts from Page Images

Reply via email to