Re: [tesseract-ocr] Tesseract Guide for newbies (first draft)

Lorenzo Bolzani Thu, 07 Feb 2019 04:28:14 -0800

Hi Kristof,
good work, I thought about it a few times. I gave a quick look, just a
couple of quick notes, I'll try to read it better when I get time.


This thread about the font size is where I got the 30/40px indication:

https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/Wdh_JJwnw94/xk2ErJnFBQAJ

For my trainings (fine tuning) I used 48px (with 2px of white border, so
text was about 44), maybe the size does not matter much if you do fine
tuning, but I never did a precise comparison. Maybe 48 is even better. The
white border probably was not important.

One thing to keep in mind is that IMO there is not THE correct way to train
because different fonts or different types of images (contrast, noise,
etc.) may work best with different parameters. So you need to experiment a
little with these if you want optimal results.

This leads to the most important part: Am I done training? without this you
are just wasting time.

What I describe in this post is not completely correct due to the way ocrd
works (I should discuss this on github so see if it should be fixed or not).

https://groups.google.com/forum/#!msg/tesseract-ocr/COJ4IjcrL6s/C1OeE9bWBgAJ

The basic idea of any machine learning training is this: split the data in
two parts, use one for training and use the other to check the result. The
idea is that if you train too much only on a few things you get
exceptionally good on these but you overspecialize and get worse at all the
rest (this is called overfitting). So you get 99.999% accuracy on the
training and 74% on the eval set and real world data that is what really
matters (real world is usually a little worse than eval).

The problem I found is that ocrd recreates the files list.train and
list.eval every time you run it (it was not designed for incremental
training I think). So, if you follow my instructions, you'll mix the train
and eval files and this is bad.

So I modified the ocrd Makefile to create these two files explicitly at the
beginning of the training (and only once).

This is the edit (about line 80):

# Create lists of lstmf filenames for training and eval
#lists: $(ALL_LSTMF) data/list.train data/list.eval
lists: $(ALL_LSTMF)

train-lists: data/list.train data/list.eval

Now you need to call "make train-lists" only once when you start a new
training session with new data (not after each "iteration step").

Ocrd by default does a 90/10 split (RATIO_TRAIN := 0.90), if you have some
data (1000/10000 samples) do a 80/20. If you have a ton (100k+ samples) of
data 90/10 or evel 95/5 may be fine.

About PSM. I did my training with PSM 6 but for one model (the most complex
one, out of 8) I found that using PSM 13 when doing the recognition gives
better results for punctation and other special characters.
Again, I do not know how much difference the PSM param makes during
training. From what I understand PSM 6 does some custom
cleanup/preprocessing to the images, PSM 13 leaves them untouched
(completely?).

About the parameters you listed in your post: I know the meaning of a few
of them but I think that in general they are quite useless (or you need to
understand more to mess with them). What I mostly refer to is the output
from lstmeval. char train and word train are the error on the recognition
these are probably the only one to look at as a reference (but these refer
to the training data, not the eval data). best char error is the best so
far, the training is noisy and goes up and down. delta is probably the
variation from the previous output and rms is root mean square of
something. In other words you do not really understand all of them to do
the training.

One iteration means one image, so max_iterations should be at least equal
to your images. If you have a ton of images you may see that you do not
need to process all of them to reach the "saturation" point when extra
training is useless, but normally you want to process all of them even a
few times (until the eval score stabilize or get worse for a few iteration).

One note: if you repeat the whole training multiple time (for example
trying different image sizes) you need to keep aside the list.train/eval
files otherwise you compare with a different set of eval images (and with a
little data set this can make a big difference).

Another note: while you fine tune (specialize) on a new "font(s)" you get a
little worse on all the others. If you care about other fonts too you
should check on them with lstmeval too.


Bye

Lorenzo

Il giorno gio 7 feb 2019 alle ore 09:36 Kristóf Horváth <
vazzzeg...@gmail.com> ha scritto:

> Hi, i set out to make a newbie friendly guide and i already have some
> stuff that might already help people, but its not complete yet. I would
> like people to read it and where they can help out with comments. I left
> places empty or left notes of my own pls feel free to figure out what
> should be there. I really hope i didnt make big mistakes, but in case i did
> write something stupid pls share it in form of a constructive criticism.
> The following things are very unclear for me  (in terms of what they
> exactly represent):
>
>    - radical-stroke.txt
>    - learning_rate
>    - noextract_font_properties
>    - 2 percent improvement
>    - time=
>    - best error was 100 @0
>    - iteration 31/100/100
>    - rms=
>    - delta=
>    - char train=
>    - word train=
>    - skip ratio=
>    - best char error=
>
> And finially here is the link
> <https://docs.google.com/document/d/1qDqbnlptcCPVIvMOHwfNws-CQat-llZLOTHC6S94Vec/edit?usp=sharing>.
> (Google docs should be in english, Im writing a wiki so formating is based
> on wiki syntax, with the link you should be able to make comments)
> In case you are really enthusiastic about it you can contact me for write
> rights.
>
> Best Regards
> Kristof Horvath
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/65c42f6a-3463-4290-905c-9dcc2d9caada%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/65c42f6a-3463-4290-905c-9dcc2d9caada%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxVhHF%2Bb8RE4qODX5GjAbezFB7_U_QNfgxuzfRLvE%3D6tg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Tesseract Guide for newbies (first draft)

Reply via email to