> > Hi Janusz, > > There are a couple of things I'd like to point out. First of all, you've > mentioned 19th Century typefaces in the past, so I'm assuming that that's > what you're used to working with. We're dealing with 15th-18th Century > documents. Like Bryan, I'm not a font history expert, but from what I've > learned over the last year, I'm willing to bet that printing practices and > standards in those early centuries of printing were probably a bit > different from what they ended up being as everything became more > established. Most of the typefaces we are looking at (if not all) were made > by hand and so can have quite individual peculiarities. As Nick pointed > out, it was not uncommon to create print blocks that contained two or three > common letter combinations on one punch (I don't think that's the > technically correct word, but I'll use it anyway). They were like ligatures > in a way even though the letters weren't actually connected. I'm going to > call these unconnected ligatures just for ease of reference throughout this > post. > > If you look closely at this Specimen Sheet from the type caster Francois > Guyot (http://collation.folger.edu/2011/09/guyots-speciman-sheet/) you'll > see a number of such unconnected ligatures, and we've seen others as Bryan > noted. You'll also see a number of upper-case letters which overhang or run > under their adjacent letters. The upper-case Q is a common example of this. > Most of these are in the italics set, but not all. > > Owing to the individualistic nature of these typefaces, we are faced with > the possibility of having to train Tesseract on every possible > typeface--something that is prohibitively expensive, if even possible. We > have used Aletheia to train several different typefaces so far, but if we > tried to created training for every hand made typeface created over the > course of 250 years, we would never finish. Thankfully it is the case that > certain type casters were quite influential and that some typefaces in > certain places would become "fashionable". So often typefaces from > different casters can be quite similar to each other. But just because a > type caster made his 'e' look like Guyot's 'e' doesn't mean that he didn't > also decide to create a bunch of unconnected ligatures in his type set, or > not create the same ones that Guyot thought was important, etc. In fact, > due to the inconsistent output of printing presses from this time, I've > found that two lower-case e characters from specimen sheets produced 200 > years apart can look more like each other than two lower-case e characters > printed on the same page of just one document using one of those typefaces. > Therefore we are pursuing the possibility that we can train Tesseract to > recognize "families" of typefaces which are similar enough to each other > that they won't require training Tesseract for each typeface (not to > mention the problem of then identifying the documents in our collections > which use each typeface). > > Doing this however, means that the idea of training Tesseract (using only > square boxes) to recognize every possible unconnected ligature in our > corpus would again be prohibitively expensive (both in terms of time and > the expertise required), and probably not possible. If we only used boxes > in training Tesseract, we'd have to closely examine every document which we > would be OCR'ing with that training in order to make sure that we > identified (and collected multiple samples of) each unconnected ligature to > add to the training. Otherwise Tesseract won't recognize them. That would > seem to defeat the purpose of using a computer to try to optically > recognize the characters. It makes much more sense to pull these > unconnected ligatures apart and train Tesseract to recognize each character > separately so as to increase Tesseract's ability to recognize these > characters on multiple documents whether they were printed as unconnected > ligatures or not. As Bryan noted, for connected ligatures, like 'sh' 'st' > 'ff', etc. we are of course training Tesseract to recognize them as one > glyph. And in that work we are using MUFI's unicode values, and even some > privately assigned ones (which we have documented by adding them to the > list created by PRImA for IMPACT at > http://tools.primaresearch.org/Special%20Characters%20in%20Aletheia.pdf). > > Besides, creating space between character glyphs during training is > exactly what's described in Tesseract's own training procedures. That's why > we created Franken+: so that we could identify each glyph in a document, > and create a Franken-document of tiffs, that match what Tesseract's > training document says it needs to be trained with. > > Another thing is that it is quite common in the documents that we are > OCR'ing for standard and italics type to be present on the same page and > even the same line. It's even not uncommon at all for documents to be > printed with both roman and blackletter fonts throughout the document, > again on the same lines. So we need to be able to train Tesseract to > recognize both standard and italics. For the italic typefaces, the letters > overlap quite often, so here using square blocks wouldn't work. I'm sure > that there are some other techniques available to train for italics, but > creating a training system that was consistent and easy to use for all the > typefaces we are dealing with was a primary goal, as we would not be able > to complete our work in the time allowed without the help of unskilled > labor. > > I'd also like to point out that none of the examples that we've provided > in any of these discussions represent unusual or special situations. They > are VERY TYPICAL for the documents we are dealing with. We also recognize > that there are going to be other cases in the 45 million page images we > have that none of our team has ever seen before. So we feel that it is > essential for us to create training that is "generic" in order to get > Tesseract to recognize as many glyphs as possible without requiring us to > identify every special case before hand. There will of course be special > cases that Tesseract will fail to recognize during the OCR'ing of 45 > million pages, which is why we are currently working so hard to create a > robust, machine learning-based, post-processing triage system to help us > identify these failures. > > I do understand what you're saying Janusz, and I think that if we were > dealing with a much smaller and more specific set of documents from a much > shorter time period, we could probably afford to be more specific in our > training. But we're not, and so some of the things you're talking about > doing just won't work for this project. > > Also, just so you know, we started by trying to train Tesseract using > high-quality page images of documents printed in typefaces we knew we were > interested in. These page images were of much better quality than the ones > we'll actually be OCR'ing. The results were terrible. We were lucky if we > could get Tesseract to recognize 80% of the words on the exact same page > we'd used to train Tesseract. And that was including using dictionaries and > a unicharambigs file that was created to address the errors Tesseract was > making on OCR'ing that page. That's why we created Franken+. > > Thanks again, > Matt Christy >
-- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

