Re: Training Tesseract for early printed text

Nick White Mon, 09 Dec 2013 04:25:31 -0800

On Sat, Dec 07, 2013 at 11:29:25AM -0800, Tom Morris wrote:
> In watching Bryan Tarpley's Franken+ presentation 
> (http://emop.tamu.edu/node/54
> ) it's pretty obvious from the example that there are (at least) two clusters
> of glyphs for the letter 'o': a tall skinny glyph and a round glyph.


Good point Tom. It wasn't clear to me from that presentation whether
the glyphs had all been taken from the same document. I slightly
suspect skinny o and round o were from different documents, or that
they were different fonts in the same document, because to me they
don't look close enough to have been made with the same metal
characters. Granted early printing was rather more haphazard than
today's (excluding print on demand, obviously ;-) ), but I still
find it hard to believe that they would have used characters cut so
differently interchangably very often.

So to me, treating them as different fonts sounds like an entirely
reasonable thing to do.

However the broader question you're raising is an important one; in
some cases it would be useful for Tesseract to treat differently
shaped glyphs as the same character. I do get the impression that it
may already do this internally, but I haven't looked into it.

> It seems like this task requires fundamentally different ways of training and
> recognizing because it violates a whole set of (very reasonable) assumptions
> that a modern OCR engine has built in to it.

It would be an interesting task to document the major assumptions that
Tesseract makes as it weighs up its tasks. To be honest I doubt many
of them are inappropriate for OCR of early modern printed works, but
I'm sure some are, and I agree considering the problem at that level
would be a very good thing to do.

> Is there anyone attacking this problem at a more fundamental level than just
> tweaking Tesseract training?  Are there other groups doing research in this
> area besides eMOP and IMPACT?

I hope to get more into these sorts of questions next year, as I
change jobs into one in which I get to spend more of my time with
fun OCR :)

Nick

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Training Tesseract for early printed text

Reply via email to