W dniu piątek, 20 grudnia 2013 21:58:17 UTC+1 użytkownik Nick White napisał: > > > > Tool allows to "cut" images on top of glyph data from PAGE file and > afterwards > > create Tesseract training page with respective box file. This can be > used for > > Tesseract training. I was testing this using script: > https://github.com/psnc-dl > > /page-generator/blob/master/src/etc/train.sh and it seems that it can > produce > > valid Tesseract profile. > > That sounds a lot like the tool that Matthew announced a few days > ago (in this very thread). Can you explain the differences a little, > please? > > You mean FRANKEN+? Yes, to some extent it is similar. Page-generator was originally developed for the purpose of IMPACT project, it was used in experiments described in this report http://lib.psnc.pl/publication/428. Since the beginning it was thought as a command line tool which can be easily integrated into larger workflow. In first step it takes PAGE XML and PNG file and prepares "cutted" font. You can manually review which glyphs should go into the training set (we don't have such a nice browser as FRANKEN) and launch second step. In second step page-generator assembles training images and prepares corresponding box file.
Page-generator was developed in 2011 but we were not able to release it as opensource till now. > > Page-generator supports also output from our tool -- Cutouts (http:// > > wlt.synat.pcss.pl/cutouts, > https://confluence.man.poznan.pl/community/display/ > > WLT/Cutouts+application) which allows to work on preparation of training > > material. > > That's interesting. Am I correct in thinking that this replaces > Aletheia as a tool to extract glyph images in your workflow? Is the > code available? Is it freely licenced? > > To large extent - yes. The biggest difference is the fact that Aletheia can handle non-square polygons for marking characters (issue of overlapping characters mentioned in this discussion). In Cutouts you can only specify square boxes for characters (they're initially loaded from box file) but webinterface has a tool which allows you to manually remove parts of overlapping glyphs from the given box. IMHO effort is similar to making non-square selection but it fits very well into Tesseract training model. Aletheia is a desktop tool, Cutouts can be used to crowdsource preparation of training materials, apart from the main interface there is also an "audit"/moderation interface which allows you to validate results of work of your crowd. Each glyph is represented as an XML and three images (original selection with overlapping parts of different characters, binarized image of a glyph, and final version after manual removal of overlapping "noise"). As for license and the source code we would like to release this as opensource but this require some additional work I hope that it will happen at some point but don't know when ;-|. I will keep you posted if you are interested in further development of this tool. Kind regards, Adam -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

