> > I would have thought the best approach for your situation, where as > you rightly point out there are more ligatures than you have the > time to find and train all of, is to train the common ligatures (as > you're doing), and just trust that less common ligatures will be > identified as separate characters that are close enough to their > non-ligatured versions that they'll be recognised as such. > > Nick, that's exactly what we're doing. And just like we aren't trying to identify all the possible ligatures in the documents we are OCRing. We're also not trying to identify all the possible ligatures in the documents from which we are creating the training. There are lots of characters like the 'k' in the ke that Bryan pointed out, that under-hang their neighbors, and I don't think all of them are ligatures. Look at that Guyot specimen sheet I linked to earlier. So I'm not sure how we would even identify ligatures when the characters are not connected. Regardless, given that Tesseract will most likely use square boxes while trying to recognize characters, the best we can do is to create training for every contiguous glyph that we can identify and hope that Tesseract will have enough information to identify them when OCR'ing whether the glyphs under/over-hang each other or not.
> I haven't had to train an italic font yet. Would the printing sorts > have been slanted for some italic fonts? I suspect so (but don't > know; someone should look it up), which would result in the slight > overlap you see. If that is the case, I wonder if Tesseract takes it > into account? Arguably it should, but as far as I know it just deals > with regular rectangles. There is certainly some extra cleverness it > does to deal with italics... I suspect small overlaps of the kind > that you'll see with italic fonts are essentially just ignored. I > don't know whether that's also true in the training process. It will > be interesting to see how the new training tools to be released deal > with italics. > I've always assumed that they are slanted and I will ask our book history expert next time I see him. Some italics fonts are more slanted than others and can have a good deal of overlap. I've also kind of wondered if specifying that a font is italic during training is how to indicate that it should use slanted boxes while OCR'ing, or something like that. -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

