Nick,

No--the example I provided was from a footnote.  I'm sure you're right that 
the original printer used "ligatures" in the sense that two or more 
characters were present on the same plate (Forgive me for not knowing book 
history terminology!  We work with folks at the Cushing library here who 
are book history scholars, and they fill that knowledge gap for people like 
me :) ).  The problem is that these custom "ligatures" are not available as 
single characters in unicode.  We originally tried to place multiple 
characters in single "boxes" to train Tesseract.  The results for us were 
poor.  While you may put more than one character per line in a Tesseract 
box file, you cannot use more than one character at a time in the 
unicharambigs file, for instance (Google claims you can but you can't--it's 
a bug).  We made a decision to treat most "ligatures" as separate 
characters, and while we're still amassing testing data, the results are 
better.  Granted, for certain ligatures like the "fl" or "sl," they have 
unicode values, so we use those.

With Franken+, using polygons to bound those characters that normally 
overlap with others has allowed us to snip them out of context and 
reproduce synthetic tiff images where they do not overlap.  These synthetic 
images (where each of the characters are pristine and none overlap) are 
what we're using to train Tesseract.

In terms of your question about Aletheia, while bounding lines, paragraphs, 
etc are not necessary for training Tesseract, we're using several 
post-processing algorithms to detect whether problems with initial OCR are 
due to poor line segmentation, reading order, column detection, etc., and 
in order to "train" these algorithms we need sample data, hence the 
meticulous bounding of characters, words, lines, paragraphs, columns, etc 
of our training tiff images using Aletheia.

The eMOP project will be releasing its entire workflow, including the 
source code for these post-processing algorithms, all of our Aletheia 
training data, and all of the tiff/box pairs we used to train Tesseract. 
 With the right hardware, in theory, anyone could replicate it.  We're 
hoping that the game changer for us will be our meticulous, font specific 
training on the front-end, the power of our Brazos supercomputing cluster 
to do enormous, parallelized OCR'ing at large scale, and our 
post-processing "triage" methods which will tell us whether poor results 
are due to the use of the wrong font, bad segmentation, the presence of 
images on the page, etc.  We'll also have several web-based tools for 
crowd-sourcing corrections (like Typewright and Aletheia Layout Editor) on 
some of the data that OCR just can't crack.

I hope that answers some of your questions--thanks for the feedback!
Bryan


On Tuesday, December 10, 2013 12:45:53 PM UTC-6, Nick White wrote:
>
> Hi Brian, nice to hear from you. 
>
> > We began using Aletheia because it was the only tool we were aware of at 
> the 
> > time which allows us to binarize an image, clean up artifacts, and bound 
> not 
> > only characters but words, lines, paragraphs, columns, pages, etc for 
> > font-training purposes.  The student workers who we pay to do much of 
> this work 
> > have varying levels of comfort/expertise with computers, so Aletheia 
> also 
> > proved to be the most GUI driven, user-friendly tool out there. 
>
> When you say it binds words, lines, paragraphs for font training 
> purposes, can you explain what you mean? I haven't used Aletheia, so 
> it isn't obvious to me. 
>
> Do you mean that the interface is separated by words, so people 
> correcting the box files can (for example) see that "babe" is 
> misrecognised as "bard" and then just click near the word and type 
> "babe"? I can see that this could be a faster approach to correcting 
> things, potentially. I don't think the current box editors we have 
> are very focused towards this sort of "proofreading" model, and 
> perhaps they should be more so. 
>
> Looking forward to hearing more from you, 
>
> Nick 
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to