Hi all, Some more corrections after talking to the developer some more (I'm just a Project Manager these days).
> We are only using Aletheia as a tool to identify each glyph and create > tiff/box file pairs for each page processed. We are not using the PAGE > format or anything like that. > Yeah, we are using the PAGE format. We were also using it previously when trying to develop Tesseract training directly from Aletheia and had developed a relatively simple XSLT to convert from PAGE to box file XML formats. Using the PAGE format has not been an issue for us, since we could easily transform it. > In fact, I think getting Franken+ to work with Tesseract/jTessBoxEditor > input should be a simple matter of adjusting the coordinate system that > Franken+ is expecting in the incoming box files (since Tesseract and > Aletheia box files have 0,0 in different corners). > I realized after talking to Bryan that someone would also have to develop code cut the images of the boxes from the page image tiff based on the boxes identified in the box file. However, since Tesseract and the jTessBoxEditor are based on squares instead of polygons these glyph images will end up with a lot of noise due to character overlap. So that will also have to be edited out. Thanks, Matt -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

