Dear Bernard, thank for your reaction - especially your experience with language combination are very useful.
On other hand I am not sure what is point you disagree with me. I did not write that training is not useful. E.g. there are several good experiences in this forum with training for old font (aka Fraktur or Gothics)[1] (even I am not sure if anybody got 100% - see polish experience described in Report on the comparison of Tesseract and ABBYY FineReader OCR engines[2] [1] https://code.google.com/p/tesseract-ocr/wiki/AddOns#Community_training_project [2] https://code.google.com/p/tesseract-ocr/wiki/Documentation#Other Zdenko On Wed, Mar 5, 2014 at 3:23 PM, Bernard Polarski <[email protected]> wrote: > I am forced to disagree for the simple and good reason that I am having > progress over the current FRA module. But this is true only if I create a > custom langage and use it in coordination with the standard. > This is probably due to the CUBE stuff which seems to be a real game > changer in the standard langage. But Cube is for the moment out of reach > for custom training. > > Example : I have a middle damage image that give a 92% accurency in using > -l FRA. > I created another library called ADF and added new > word-dwag, new ambigs entries and a set of 15 box/tif certified 'georgia' > font taken from scanned books. Certified box file are re-checked to assert > 100% correctness. I > > Next I used it with the -l FRA+ADF and got a 98% accurracy on this same > pic. > > BUT. There are also regressions that appears. Some characters that are > correctly recognized with -l FRA (namely a damaged font 'd' for the word > <des>' becomes <(les> with -l FRA+ADF. > but much more characters that where not recognized with FRA are now > correctly recognized with FRA+ADF. I tried to fix this with new ambigs > rules in the ADF ambig rules but regression remains. > > Good to know is that when I tried to backport all goodied from my custom > made ADF library directly into FRA, the result were disapointing. > The best combination so far is to segregate all custom training into a new > library and perform the tesseract with both ( -l FRA+ADF in my case). > > I suspect that the failure to obtain improvements directly from > modification in the base langage library (for me FRA) is due to the cube > stuff which alter the rules. > > Also I wrote a series of scripts around imagemagick where, for a set of > given images, each with its certified box, generate a new box file using > tesseract and compared the certified box with the generated box in order > to extract from the image every character with a mismatched translation. > This process is fully automated and next, the script collated all these > new certified failed-to-be-recognized character into a new image to > be retrainted using the certifed box. The result was disapointing but not > totally without effect. Just diapointing for the moment. > > I plan to improve the process : instead of extraction just the caracter, > extract the word where there is a failed characters, recreate a new image > with these failed words. To achive that I intend to explore this new > feature from image2text --output_word_boxes to help identify a word. > otherwise I am good to write a procedure to find the word boundaries for a > given box. > > Last, I noticed a very big improvement from 3.02 to 3.03 on > the more-or-less damages images. Version 3.03 showed alone more improvement > than my tweaked 3.02 <-l FRA+ADF> > When it comes to images with clear picx (like the one saved from MS word > onto a TIFF) , I already see result 100% correct, but then ABBY also gives > 100% on these favorable conditions. > The real challange is the scanned books image. > > > > Le mercredi 5 mars 2014 10:54:36 UTC+1, zdenop a écrit : > >> You need to port all training tools to android. >> >> Generally (my opinion): >> >> 1. Unless you have proof that you MUST do custom training - training >> is wasting of time (nobody was able to create better language data for the >> existing language and common fonts at Google) >> 2. Unless you do not understand training process (probably you will >> need to read the source code) - training is wasting of time >> >> >> >> Zdenko >> >> >> On Wed, Mar 5, 2014 at 9:39 AM, Tushar Makkar <[email protected]>wrote: >> >>> I am using the tess-two (https://github.com/rmtheis/tess-two) library >>> for OCR recognition on Android . I want to create the training data on >>> Android . I have followed https://code.google.com/p/tesseract-ocr/wiki/ >>> TrainingTesseract3 and successfully created training data on linux >>> system . How to do the same on Android using tess-two or any other library >>> ? >>> >>> -- >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To post to this group, send email to [email protected] >>> >>> To unsubscribe from this group, send email to >>> [email protected] >>> >>> For more options, visit this group at >>> http://groups.google.com/group/tesseract-ocr?hl=en >>> >>> --- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> >>> For more options, visit https://groups.google.com/groups/opt_out. >>> >> >> -- > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > > --- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.

