Re: Tesseract fails recognizing simple and isolated digits. How can I train tesseract for recognizing digits from unknown font type

V.Lorz Wed, 26 Mar 2014 13:57:12 -0700

Hi Nick, thanks for taking a time to write.

>> Firstly, it's Tesseract 3.02.02, not 3.2
The emgu wrapper around tesseract return a System.Version class instance 
which returns integer values for its Major, Minor Revision and Build 
properties.


>> Out of curiousity, why did you think that training would help you here?
I made myself one simple question after seeing this behaviour with several 
images, all the time with the same two characters, '5' and '6': Why would 
the engine return two different character codes for two almost identical 
blobs? 
The most reasonable conclusion for me was it has something to see with 
training.

>> (...) but (AFAIK) our documentation doesn't imply it anywhere. 
You're right on that, I just followed what for me was common sense.

>> You may just have to accept that the accuracy from Tesseract won't be 
100%, I'm afraid.
40 images processed for each batch, 7 digits per image, 2-to-3 erroneous 
result sets on every batch (one digit each), never less than 2 errors, 
always with the same two digits. Counting digits it is a high success rate, 
but counting figures (7 digits) it ranges from 92.5% to 95.0%. Too low for 
the client.

Any clues on how to improve this?

V.Lorz

On Wednesday, March 26, 2014 7:53:39 PM UTC+1, Nick White wrote:
>
> Hi V.Lorz, 
>
> Firstly, it's Tesseract 3.02.02, not 3.2. We may release version 3.2 
> someday, but not for a long time yet ;) 
>
> Doing training is not going to help you, I'm afraid. The font is 
> quite standard, so you aren't going to be able to do a better job at 
> training Tesseract for it than the eng.traineddata provides. 
>
> Out of curiousity, why did you think that training would help you 
> here? I ask as it's a very common misconception, but (AFAIK) our 
> documentation doesn't imply it anywhere. 
>
> You may just have to accept that the accuracy from Tesseract won't 
> be 100%, I'm afraid. Maybe someone else here has suggestions, but 
> the image looks alright to me, so the general advice of "more 
> preprocessing" may not be helpful. 
>
> Nick 
>
> On Wed, Mar 26, 2014 at 11:10:56AM -0700, V.Lorz wrote: 
> > Hi All, 
> > 
> > I started integrating tesseract (version 3.2, EMGV) in a project for 
> > recognizing short texts in scanned images. Using some very simple image 
> > processing I extract the area of interest for speeding up the process. 
> > 
> > The errors I get are related to recognition results, tesseract sometimes 
> > confuses the digits '6' and '5', the image bellow is recognized as 
> "4436695" 
> > instead of "4436696". I'm using the default eng.traineddata file bundled 
> with 
> > the library. Using some other trained data files from around the Inet I 
> got the 
> > same results with the same two digits (5 and 6). Before processing the 
> image I 
> > configure tesseract to process only digits. 
> > 
> > 
> > [VwAAAAASUV] 
> > 
> > Does anyone know what could be causing this error? How could I solve it? 
> > 
> > I started reading the guide for training the engine (
> http://code.google.com/p/ - tracked <http://code.google.com/p/> 
> > tesseract-ocr/wiki/TrainingTesseract3) as suggested in some other 
> threads, but 
> > it is of near to no help for me. Is there any other guide around for 
> 'dummies' 
> > like [presummably :(] me? In this case I want to train it using one 
> image that 
> > I created from 40 sampled documents (attached here). Using 
> jTessBoxEditor-1.0 I 
> > was able to generate and correct the box file. What should I do next? 
> > 
> > 
> > Thanks a lot in advance, V.Lorz 
> > 
> > 
> > -- 
> > -- 
> > You received this message because you are subscribed to the Google 
> > Groups "tesseract-ocr" group. 
> > To post to this group, send email to 
> > [email protected]<javascript:> 
> > To unsubscribe from this group, send email to 
> > [email protected] <javascript:> 
> > For more options, visit this group at 
> > http://groups.google.com/group/tesseract-ocr?hl=en - 
> > tracked<http://groups.google.com/group/tesseract-ocr?hl=en> 
> > 
> > --- 
> > You received this message because you are subscribed to the Google 
> Groups 
> > "tesseract-ocr" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email 
> > to [email protected] <javascript:>. 
> > For more options, visit https://groups.google.com/d/optout - 
> > tracked<https://groups.google.com/d/optout>. 
>
>
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Tesseract fails recognizing simple and isolated digits. How can I train tesseract for recognizing digits from unknown font type

Reply via email to