Re: tesseract testing suite

Shree Devi Kumar Thu, 18 Apr 2013 23:56:05 -0700

On Thu, Apr 18, 2013 at 11:02 PM, Nick White <nick.wh...@durham.ac.uk>wrote:


> Hi Shree,
>
> I'm glad you found my article helpful. Apologies for the delay in my
> reply to you. I'll answer your questions below.
>

Thanks, Nick!


>
> > I have found that trying to improve recognition by adding more training
> data
> > sometimes leads to worse recognition. I am currently trying with just
> one font.
> > Using multiple fonts sometimes fails with:
> >
> > Font id = -1/2, class id = 96/2922 on sample 70292
> > font_id >= 0 && font_id < font_id_map_.SparseSize():Error:Assert
> failed:in file
> > ..\..\clasne 622
>
> I don't think I've seen that failure before. But yes, you're right
> that adding more training data can produce worse results.
>

I figured out that one. It was caused by font_properties not having the
correct entries.

>
> > I would like to try your testing suite so that I can see whether there is
> > improvement in the training data- do you have a windows binary for the
> same?
>
> I don't have Windows binaries for them. The tools themselves should
> compile for Windows, but the issue is that to work beyond ASCII they
> need to be run with a wrapper script, that is Unix only
> ('ocrevalutf8'). I would recommend you set up Cygwin; they will be
> easy to compile and run from there.
>
>
Ok. I'll give it a try.



>  > Is the recommended training process to train one font and then add
> another? Or
> > train them separately then merge??
>
> I'm not sure I understand the question. How do the above two methods
> differ, in the case of tesseract training?
>


The first method would be to train the language with font1 and once that is
good, then add files for font2, so that there is one traineddata lang file
with font1 and font2.

The 2nd merge would be using font1 and font2 trained separately as lang1
and lang2, and then at OCR time using the option -l lang1+lang2

OR thirdly, try to train with training data for multiple fonts at the same
time for one language.


> > Does the order in which tif/box files are given matter?
>
> Not as far as I know.
>
> > If I am trying to fix errors, should new training data be given at end
> of old
> > training data or before?
>
> I also don't understand this question. Can you expand on what you
> mean, please?
>

I am using multiple tif/box pairs for one font. So, when I got a
particular kind of recurring errors in the OCRed output, I created another
box/tif pair with the characters in question, hoping that additional
training would help remove the errors.

So, my question was, whether I should add this new pair in the begining of
the list of files given for training or at end.

This assumes that the order in which the files are given influences
training. If it does not then this does not matter.

>
> Hope this helps, and I look forward to hearing back from you.
>

I appreciate your reply. From reading the forum/discussion so far it seems
that it is NOT recommended to try and train a language for which
traineddata is already available, as it is possible that Google will
provide improved data in future.

So, I may continue this testing as an experiment

only :-)

Thanks!

>
> Nick
>
> --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to tesseract-ocr@googlegroups.com
> To unsubscribe from this group, send email to
> tesseract-ocr+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: tesseract testing suite

Reply via email to