Hi Nick, thanks for taking a time to write. >> Firstly, it's Tesseract 3.02.02, not 3.2 The emgu wrapper around tesseract return a System.Version class instance which returns integer values for its Major, Minor Revision and Build properties.
>> Out of curiousity, why did you think that training would help you here? I made myself one simple question after seeing this behaviour with several images, all the time with the same two characters, '5' and '6': Why would the engine return two different character codes for two almost identical blobs? The most reasonable conclusion for me was it has something to see with training. >> (...) but (AFAIK) our documentation doesn't imply it anywhere. You're right on that, I just followed what for me was common sense. >> You may just have to accept that the accuracy from Tesseract won't be 100%, I'm afraid. 40 images processed for each batch, 7 digits per image, 2-to-3 erroneous result sets on every batch (one digit each), never less than 2 errors, always with the same two digits. Counting digits it is a high success rate, but counting figures (7 digits) it ranges from 92.5% to 95.0%. Too low for the client. Any clues on how to improve this? V.Lorz On Wednesday, March 26, 2014 7:53:39 PM UTC+1, Nick White wrote: > > Hi V.Lorz, > > Firstly, it's Tesseract 3.02.02, not 3.2. We may release version 3.2 > someday, but not for a long time yet ;) > > Doing training is not going to help you, I'm afraid. The font is > quite standard, so you aren't going to be able to do a better job at > training Tesseract for it than the eng.traineddata provides. > > Out of curiousity, why did you think that training would help you > here? I ask as it's a very common misconception, but (AFAIK) our > documentation doesn't imply it anywhere. > > You may just have to accept that the accuracy from Tesseract won't > be 100%, I'm afraid. Maybe someone else here has suggestions, but > the image looks alright to me, so the general advice of "more > preprocessing" may not be helpful. > > Nick > > On Wed, Mar 26, 2014 at 11:10:56AM -0700, V.Lorz wrote: > > Hi All, > > > > I started integrating tesseract (version 3.2, EMGV) in a project for > > recognizing short texts in scanned images. Using some very simple image > > processing I extract the area of interest for speeding up the process. > > > > The errors I get are related to recognition results, tesseract sometimes > > confuses the digits '6' and '5', the image bellow is recognized as > "4436695" > > instead of "4436696". I'm using the default eng.traineddata file bundled > with > > the library. Using some other trained data files from around the Inet I > got the > > same results with the same two digits (5 and 6). Before processing the > image I > > configure tesseract to process only digits. > > > > > > [VwAAAAASUV] > > > > Does anyone know what could be causing this error? How could I solve it? > > > > I started reading the guide for training the engine ( > http://code.google.com/p/ - tracked <http://code.google.com/p/> > > tesseract-ocr/wiki/TrainingTesseract3) as suggested in some other > threads, but > > it is of near to no help for me. Is there any other guide around for > 'dummies' > > like [presummably :(] me? In this case I want to train it using one > image that > > I created from 40 sampled documents (attached here). Using > jTessBoxEditor-1.0 I > > was able to generate and correct the box file. What should I do next? > > > > > > Thanks a lot in advance, V.Lorz > > > > > > -- > > -- > > You received this message because you are subscribed to the Google > > Groups "tesseract-ocr" group. > > To post to this group, send email to > > [email protected]<javascript:> > > To unsubscribe from this group, send email to > > [email protected] <javascript:> > > For more options, visit this group at > > http://groups.google.com/group/tesseract-ocr?hl=en - > > tracked<http://groups.google.com/group/tesseract-ocr?hl=en> > > > > --- > > You received this message because you are subscribed to the Google > Groups > > "tesseract-ocr" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email > > to [email protected] <javascript:>. > > For more options, visit https://groups.google.com/d/optout - > > tracked<https://groups.google.com/d/optout>. > > > > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.

