Re: Franken+ Released -- New Tool For Training Tesseract on Fonts from Page Images

matthew christy Wed, 18 Dec 2013 13:22:10 -0800

>
> I would have thought the best approach for your situation, where as 
> you rightly point out there are more ligatures than you have the 
> time to find and train all of, is to train the common ligatures (as 
> you're doing), and just trust that less common ligatures will be 
> identified as separate characters that are close enough to their 
> non-ligatured versions that they'll be recognised as such. 
>
> Nick, that's exactly what we're doing. And just like we aren't trying to 
identify all the possible ligatures in the documents we are OCRing. We're 
also not trying to identify all the possible ligatures in the documents 
from which we are creating the training. There are lots of characters like 
the 'k' in the ke that Bryan pointed out, that under-hang their neighbors, 
and I don't think all of them are ligatures. Look at that Guyot specimen 
sheet I linked to earlier. So I'm not sure how we would even identify 
ligatures when the characters are not connected. Regardless, given that 
Tesseract will most likely use square boxes while trying to recognize 
characters, the best we can do is to create training for every contiguous 
glyph that we can identify and hope that Tesseract will have enough 
information to identify them when OCR'ing whether the glyphs 
under/over-hang each other or not.



> I haven't had to train an italic font yet. Would the printing sorts 
> have been slanted for some italic fonts? I suspect so (but don't 
> know; someone should look it up), which would result in the slight 
> overlap you see. If that is the case, I wonder if Tesseract takes it 
> into account? Arguably it should, but as far as I know it just deals 
> with regular rectangles. There is certainly some extra cleverness it 
> does to deal with italics... I suspect small overlaps of the kind 
> that you'll see with italic fonts are essentially just ignored. I 
> don't know whether that's also true in the training process. It will 
> be interesting to see how the new training tools to be released deal 
> with italics. 
>
 I've always assumed that they are slanted and I will ask our book history 
expert next time I see him. Some italics fonts are more slanted than others 
and can have a good deal of overlap. I've also kind of wondered if 
specifying that a font is italic during training is how to indicate that it 
should use slanted boxes while OCR'ing, or something like that.

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Franken+ Released -- New Tool For Training Tesseract on Fonts from Page Images

Reply via email to