Hi Matt,

On Wed, Dec 11, 2013 at 06:17:00AM -0800, matthew christy wrote:
> If we only used boxes
> in training Tesseract, we'd have to closely examine every document which we
> would be OCR'ing with that training in order to make sure that we
> identified (and collected multiple samples of) each unconnected ligature to
> add to the training. Otherwise Tesseract won't recognize them. That would
> seem to defeat the purpose of using a computer to try to optically
> recognize the characters. It makes much more sense to pull these
> unconnected ligatures apart and train Tesseract to recognize each character
> separately so as to increase Tesseract's ability to recognize these
> characters on multiple documents whether they were printed as unconnected
> ligatures or not.

You may be right, but I'm not entirely convinced. "Pulling apart"
the k and e in your recent example I'm not sure makes sense, because
you're unlikely to see a k with that long of a tail that isn't part
of a ligature anyway. So if Tesseract saw a ligature like the ke
(and it hadn't been trained for it as one character), it would
probably break it down into a k and e such that much of the
tail of the k was not part of the k box anyway. Unless tesseract
worked by splitting glyphs into arbitrary shapes (which it doesn't,
and won't) I don't think it makes sense for you to train it for
ligatures using them.

I would have thought the best approach for your situation, where as
you rightly point out there are more ligatures than you have the
time to find and train all of, is to train the common ligatures (as
you're doing), and just trust that less common ligatures will be
identified as separate characters that are close enough to their
non-ligatured versions that they'll be recognised as such.

> Besides, creating space between character glyphs during training is exactly
> what's described in Tesseract's own training procedures. That's why we
> created Franken+: so that we could identify each glyph in a document, and
> create a Franken-document of tiffs, that match what Tesseract's training
> document says it needs to be trained with.

I think that the issue of needing plenty of space between letters
when training is less acute than it used to be, so this may not be
a big issue anymore. It was a big problem with Tesseract 2.x,
certainly, but should be less so now. There are advantages to using
"realistic" spacing, regarding it more accurately estimating
characters' positions on the line and closeness to their neighbours.
It may still be that the characters in your source documents are
still too close for comfort, but I wouldn't bet on it.

> For the italic typefaces, the letters
> overlap quite often, so here using square blocks wouldn't work. I'm sure
> that there are some other techniques available to train for italics, but
> creating a training system that was consistent and easy to use for all the
> typefaces we are dealing with was a primary goal, as we would not be able
> to complete our work in the time allowed without the help of unskilled
> labor.

I haven't had to train an italic font yet. Would the printing sorts
have been slanted for some italic fonts? I suspect so (but don't
know; someone should look it up), which would result in the slight
overlap you see. If that is the case, I wonder if Tesseract takes it
into account? Arguably it should, but as far as I know it just deals
with regular rectangles. There is certainly some extra cleverness it
does to deal with italics... I suspect small overlaps of the kind
that you'll see with italic fonts are essentially just ignored. I
don't know whether that's also true in the training process. It will
be interesting to see how the new training tools to be released deal
with italics.

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to