Hi Matt, On Wed, Dec 11, 2013 at 06:17:00AM -0800, matthew christy wrote: > If we only used boxes > in training Tesseract, we'd have to closely examine every document which we > would be OCR'ing with that training in order to make sure that we > identified (and collected multiple samples of) each unconnected ligature to > add to the training. Otherwise Tesseract won't recognize them. That would > seem to defeat the purpose of using a computer to try to optically > recognize the characters. It makes much more sense to pull these > unconnected ligatures apart and train Tesseract to recognize each character > separately so as to increase Tesseract's ability to recognize these > characters on multiple documents whether they were printed as unconnected > ligatures or not.
You may be right, but I'm not entirely convinced. "Pulling apart" the k and e in your recent example I'm not sure makes sense, because you're unlikely to see a k with that long of a tail that isn't part of a ligature anyway. So if Tesseract saw a ligature like the ke (and it hadn't been trained for it as one character), it would probably break it down into a k and e such that much of the tail of the k was not part of the k box anyway. Unless tesseract worked by splitting glyphs into arbitrary shapes (which it doesn't, and won't) I don't think it makes sense for you to train it for ligatures using them. I would have thought the best approach for your situation, where as you rightly point out there are more ligatures than you have the time to find and train all of, is to train the common ligatures (as you're doing), and just trust that less common ligatures will be identified as separate characters that are close enough to their non-ligatured versions that they'll be recognised as such. > Besides, creating space between character glyphs during training is exactly > what's described in Tesseract's own training procedures. That's why we > created Franken+: so that we could identify each glyph in a document, and > create a Franken-document of tiffs, that match what Tesseract's training > document says it needs to be trained with. I think that the issue of needing plenty of space between letters when training is less acute than it used to be, so this may not be a big issue anymore. It was a big problem with Tesseract 2.x, certainly, but should be less so now. There are advantages to using "realistic" spacing, regarding it more accurately estimating characters' positions on the line and closeness to their neighbours. It may still be that the characters in your source documents are still too close for comfort, but I wouldn't bet on it. > For the italic typefaces, the letters > overlap quite often, so here using square blocks wouldn't work. I'm sure > that there are some other techniques available to train for italics, but > creating a training system that was consistent and easy to use for all the > typefaces we are dealing with was a primary goal, as we would not be able > to complete our work in the time allowed without the help of unskilled > labor. I haven't had to train an italic font yet. Would the printing sorts have been slanted for some italic fonts? I suspect so (but don't know; someone should look it up), which would result in the slight overlap you see. If that is the case, I wonder if Tesseract takes it into account? Arguably it should, but as far as I know it just deals with regular rectangles. There is certainly some extra cleverness it does to deal with italics... I suspect small overlaps of the kind that you'll see with italic fonts are essentially just ignored. I don't know whether that's also true in the training process. It will be interesting to see how the new training tools to be released deal with italics. -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

