On Sat, Dec 07, 2013 at 11:29:25AM -0800, Tom Morris wrote: > In watching Bryan Tarpley's Franken+ presentation > (http://emop.tamu.edu/node/54 > ) it's pretty obvious from the example that there are (at least) two clusters > of glyphs for the letter 'o': a tall skinny glyph and a round glyph.
Good point Tom. It wasn't clear to me from that presentation whether the glyphs had all been taken from the same document. I slightly suspect skinny o and round o were from different documents, or that they were different fonts in the same document, because to me they don't look close enough to have been made with the same metal characters. Granted early printing was rather more haphazard than today's (excluding print on demand, obviously ;-) ), but I still find it hard to believe that they would have used characters cut so differently interchangably very often. So to me, treating them as different fonts sounds like an entirely reasonable thing to do. However the broader question you're raising is an important one; in some cases it would be useful for Tesseract to treat differently shaped glyphs as the same character. I do get the impression that it may already do this internally, but I haven't looked into it. > It seems like this task requires fundamentally different ways of training and > recognizing because it violates a whole set of (very reasonable) assumptions > that a modern OCR engine has built in to it. It would be an interesting task to document the major assumptions that Tesseract makes as it weighs up its tasks. To be honest I doubt many of them are inappropriate for OCR of early modern printed works, but I'm sure some are, and I agree considering the problem at that level would be a very good thing to do. > Is there anyone attacking this problem at a more fundamental level than just > tweaking Tesseract training? Are there other groups doing research in this > area besides eMOP and IMPACT? I hope to get more into these sorts of questions next year, as I change jobs into one in which I get to spend more of my time with fun OCR :) Nick -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

