Nick, No--the example I provided was from a footnote. I'm sure you're right that the original printer used "ligatures" in the sense that two or more characters were present on the same plate (Forgive me for not knowing book history terminology! We work with folks at the Cushing library here who are book history scholars, and they fill that knowledge gap for people like me :) ). The problem is that these custom "ligatures" are not available as single characters in unicode. We originally tried to place multiple characters in single "boxes" to train Tesseract. The results for us were poor. While you may put more than one character per line in a Tesseract box file, you cannot use more than one character at a time in the unicharambigs file, for instance (Google claims you can but you can't--it's a bug). We made a decision to treat most "ligatures" as separate characters, and while we're still amassing testing data, the results are better. Granted, for certain ligatures like the "fl" or "sl," they have unicode values, so we use those.
With Franken+, using polygons to bound those characters that normally overlap with others has allowed us to snip them out of context and reproduce synthetic tiff images where they do not overlap. These synthetic images (where each of the characters are pristine and none overlap) are what we're using to train Tesseract. In terms of your question about Aletheia, while bounding lines, paragraphs, etc are not necessary for training Tesseract, we're using several post-processing algorithms to detect whether problems with initial OCR are due to poor line segmentation, reading order, column detection, etc., and in order to "train" these algorithms we need sample data, hence the meticulous bounding of characters, words, lines, paragraphs, columns, etc of our training tiff images using Aletheia. The eMOP project will be releasing its entire workflow, including the source code for these post-processing algorithms, all of our Aletheia training data, and all of the tiff/box pairs we used to train Tesseract. With the right hardware, in theory, anyone could replicate it. We're hoping that the game changer for us will be our meticulous, font specific training on the front-end, the power of our Brazos supercomputing cluster to do enormous, parallelized OCR'ing at large scale, and our post-processing "triage" methods which will tell us whether poor results are due to the use of the wrong font, bad segmentation, the presence of images on the page, etc. We'll also have several web-based tools for crowd-sourcing corrections (like Typewright and Aletheia Layout Editor) on some of the data that OCR just can't crack. I hope that answers some of your questions--thanks for the feedback! Bryan On Tuesday, December 10, 2013 12:45:53 PM UTC-6, Nick White wrote: > > Hi Brian, nice to hear from you. > > > We began using Aletheia because it was the only tool we were aware of at > the > > time which allows us to binarize an image, clean up artifacts, and bound > not > > only characters but words, lines, paragraphs, columns, pages, etc for > > font-training purposes. The student workers who we pay to do much of > this work > > have varying levels of comfort/expertise with computers, so Aletheia > also > > proved to be the most GUI driven, user-friendly tool out there. > > When you say it binds words, lines, paragraphs for font training > purposes, can you explain what you mean? I haven't used Aletheia, so > it isn't obvious to me. > > Do you mean that the interface is separated by words, so people > correcting the box files can (for example) see that "babe" is > misrecognised as "bard" and then just click near the word and type > "babe"? I can see that this could be a faster approach to correcting > things, potentially. I don't think the current box editors we have > are very focused towards this sort of "proofreading" model, and > perhaps they should be more so. > > Looking forward to hearing more from you, > > Nick > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

