Hi Bryan, I'm responding to parts of several of your messages below. On Tue, Dec 10, 2013 at 03:15:31PM -0600, Bryan Tarpley wrote: > Our initial findings are that trying to train > Tesseract to recognize these ligatures is less effective than training it to > treat them as separate characters. In other words, we're having better > results > normalizing on the front end, both in terms of accuracy and efficiency > re:Tesseract.
That is suprising, because Tesseract segments characters in boxes (just like its makebox mode) when it does OCR, so I'd expect overlapping ligatures to be better detected when trained for than as separate characters. I suppose ligatures may well on average vary more, which might explain it. But still, it's suprising. > While you may put more than one > character per line in a Tesseract box file, you cannot use more than > one character at a time in the unicharambigs file, for instance (Google > claims you can but you can't--it's a bug) Have you reported this on the issue tracker? Please do if you haven't; that is certainly a bug that should be fixed (and shouldn't be too difficult to fix). http://code.google.com/p/tesseract-ocr/issues/list > The eMOP project will be releasing its entire workflow, including > the source code for these post-processing algorithms, all of our > Aletheia training data, and all of the tiff/box pairs we used to > train Tesseract. With the right hardware, in theory, anyone could > replicate it. We're hoping that the game changer for us will be our > meticulous, font specific training on the front-end, the power of > our Brazos supercomputing cluster to do enormous, parallelized > OCR'ing at large scale, and our post-processing "triage" methods > which will tell us whether poor results are due to the use of the > wrong font, bad segmentation, the presence of images on the page, > etc. We'll also have several web-based tools for crowd-sourcing > corrections (like Typewright and Aletheia Layout Editor) on some > of the data that OCR just can't crack. All good and laudable, certainly, and I am very happy to hear it. Though the reliance on the proprietary Aletheia throws a cog in the works; anybody can not replicate it without the permission and support of the Aletheia people, and nor can anybody but them really disect how that part of the system works. I know we keep harping on about it, but it is really important to a lot of us. Particularly, for me, for a publically funded academic project. I look forward very much to the scripts that try to predict the reasons for poor results - the ways they figure that out can surely be fed back in to Tesseract to improve it further. Thanks for all the details about what you're up to, it's very interesting indeed. Nick P.S. Apologies for mis-spelling your name earlier. -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

