W dniu piątek, 20 grudnia 2013 21:58:17 UTC+1 użytkownik Nick White napisał:
>
>
> > Tool allows to "cut" images on top of glyph data from PAGE file and 
> afterwards 
> > create Tesseract training page with respective box file. This can be 
> used for 
> > Tesseract training. I was testing this using script: 
> https://github.com/psnc-dl 
> > /page-generator/blob/master/src/etc/train.sh and it seems that it can 
> produce 
> > valid Tesseract profile. 
>
> That sounds a lot like the tool that Matthew announced a few days 
> ago (in this very thread). Can you explain the differences a little, 
> please? 
>
>
You mean FRANKEN+? Yes, to some extent it is similar. Page-generator was 
originally developed for the purpose of IMPACT project, it was used in 
experiments described in this report http://lib.psnc.pl/publication/428. 
Since the beginning it was thought as a command line tool which can be 
easily integrated into larger workflow. In first step it takes PAGE XML and 
PNG file and prepares "cutted" font. You can manually review which glyphs 
should go into the training set (we don't have such a nice browser as 
FRANKEN) and launch second step. In second step page-generator assembles 
training images and prepares corresponding box file. 

Page-generator was developed in 2011 but we were not able to release it as 
opensource till now.
 

> > Page-generator supports also output from our tool -- Cutouts (http:// 
> > wlt.synat.pcss.pl/cutouts, 
> https://confluence.man.poznan.pl/community/display/ 
> > WLT/Cutouts+application) which allows to work on preparation of training 
> > material. 
>
> That's interesting. Am I correct in thinking that this replaces 
> Aletheia as a tool to extract glyph images in your workflow? Is the 
> code available? Is it freely licenced? 
>
>
To large extent - yes. The biggest difference is the fact that Aletheia can 
handle non-square polygons for marking characters (issue of overlapping 
characters mentioned in this discussion). In Cutouts you can only specify 
square boxes for characters (they're initially loaded from box file) but 
webinterface has a tool which allows you to manually remove parts of 
overlapping glyphs from the given box. IMHO effort is similar to making 
non-square selection but it fits very well into Tesseract training model. 

Aletheia is a desktop tool, Cutouts can be used to crowdsource preparation 
of training materials, apart from the main interface there is also an 
"audit"/moderation interface which allows you to validate results of work 
of your crowd. Each glyph is represented as an XML and three images 
(original selection with overlapping parts of different characters, 
binarized image of a glyph, and final version after manual removal of 
overlapping "noise"). 

As for license and the source code we would like to release this as 
opensource but this require some additional work I hope that it will happen 
at some point but don't know when ;-|. I will keep you posted if you are 
interested in further development of this tool.

Kind regards,
Adam


-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to