Re: Franken+ Released -- New Tool For Training Tesseract on Fonts from Page Images

matthew christy Mon, 09 Dec 2013 14:06:30 -0800

Hi all, 

Some more corrections after talking to the developer some more (I'm just a 
Project Manager these days).



> We are only using Aletheia as a tool to identify each glyph and create 
> tiff/box file pairs for each page processed. We are not using the PAGE 
> format or anything like that. 
>
Yeah, we are using the PAGE format. We were also using it previously when 
trying to develop Tesseract training directly from Aletheia and had 
developed a relatively simple XSLT to convert from PAGE to box file XML 
formats. Using the PAGE format has not been an issue for us, since we could 
easily transform it.


> In fact, I think getting Franken+ to work with Tesseract/jTessBoxEditor 
> input should be a simple matter of adjusting the coordinate system that 
> Franken+ is expecting in the incoming box files (since Tesseract and 
> Aletheia box files have 0,0 in different corners).
>
I realized after talking to Bryan that someone would also have to develop 
code cut the images of the boxes from the page image tiff based on the 
boxes identified in the box file. However, since Tesseract and the 
jTessBoxEditor are based on squares instead of polygons these glyph images 
will end up with a lot of noise due to character overlap. So that will also 
have to be edited out. 

Thanks,
Matt 

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Franken+ Released -- New Tool For Training Tesseract on Fonts from Page Images

Reply via email to