Nick,

Some of the ligatures for which we have unicode equivalents (like "sl" and
"fl"), and which clearly form a single, contiguous shape, are without a
doubt best treated as a single character.  But others such as the "ke"
"ligature" I provided in my attachment earlier in this thread is not
composed of two letters that form a contiguous shape--they are clearly
separate letters that only "overlap" when you draw boxes around them.
 We've found that when two letters aren't touching, Tesseract has trouble
identifying them together as a single ligature, /especially/ given that the
character "e" by itself looks exactly the same as the one in "ke."  In
those cases, even though the printer may have combined the "k" and the "e"
onto the same plate to form the "ligature" "ke," (what's the better word
for plate here?), it is better to train Tesseract to recognize them as
separate characters, from what we've found.   I feel like I'm talking in
circles, so if this is making no sense, I can try to give example images of
what I'm talking about tomorrow.

No worries about mispelling my name--I once got a card addressed to "Brain
Tarpley" ;)

Thanks,
b

On Tuesday, December 10, 2013, Nick White wrote:

> Hi Bryan, I'm responding to parts of several of your messages below.
>
> On Tue, Dec 10, 2013 at 03:15:31PM -0600, Bryan Tarpley wrote:
> > Our initial findings are that trying to train
> > Tesseract to recognize these ligatures is less effective than training
> it to
> > treat them as separate characters.  In other words, we're having better
> results
> > normalizing on the front end, both in terms of accuracy and efficiency
> > re:Tesseract.
>
> That is suprising, because Tesseract segments characters in boxes
> (just like its makebox mode) when it does OCR, so I'd expect
> overlapping ligatures to be better detected when trained for than as
> separate characters. I suppose ligatures may well on average vary
> more, which might explain it. But still, it's suprising.
>
> > While you may put more than one
> > character per line in a Tesseract box file, you cannot use more than
> > one character at a time in the unicharambigs file, for instance (Google
> > claims you can but you can't--it's a bug)
>
> Have you reported this on the issue tracker? Please do if you
> haven't; that is certainly a bug that should be fixed (and shouldn't
> be too difficult to fix).
> http://code.google.com/p/tesseract-ocr/issues/list
>
> > The eMOP project will be releasing its entire workflow, including
> > the source code for these post-processing algorithms, all of our
> > Aletheia training data, and all of the tiff/box pairs we used to
> > train Tesseract.  With the right hardware, in theory, anyone could
> > replicate it.  We're hoping that the game changer for us will be our
> > meticulous, font specific training on the front-end, the power of
> > our Brazos supercomputing cluster to do enormous, parallelized
> > OCR'ing at large scale, and our post-processing "triage" methods
> > which will tell us whether poor results are due to the use of the
> > wrong font, bad segmentation, the presence of images on the page,
> > etc.  We'll also have several web-based tools for crowd-sourcing
> > corrections (like Typewright and Aletheia Layout Editor) on some
> > of the data that OCR just can't crack.
>
> All good and laudable, certainly, and I am very happy to hear it.
> Though the reliance on the proprietary Aletheia throws a cog in the
> works; anybody can not replicate it without the permission and
> support of the Aletheia people, and nor can anybody but them really
> disect how that part of the system works. I know we keep harping on
> about it, but it is really important to a lot of us. Particularly,
> for me, for a publically funded academic project.
>
> I look forward very much to the scripts that try to predict the
> reasons for poor results - the ways they figure that out can surely
> be fed back in to Tesseract to improve it further.
>
> Thanks for all the details about what you're up to, it's very
> interesting indeed.
>
> Nick
>
> P.S. Apologies for mis-spelling your name earlier.
>
> --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to 
> [email protected]<javascript:;>
> To unsubscribe from this group, send email to
> [email protected] <javascript:;>
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to a topic in the
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/tesseract-ocr/A1Qq_vfKyRs/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected] <javascript:;>.
> For more options, visit https://groups.google.com/groups/opt_out.
>


-- 
Bryan Tarpley
Graduate Research Assistant
Texas A&M | IDHMC
LAAH 439
[email protected]

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to