Re: [tesseract-ocr] How to choose the stop condition of LSTM training

Lorenzo Bolzani Thu, 18 Apr 2019 00:00:51 -0700

Yes, lstmeval is manual but easy to automate. I use a script like this:

./train.sh $NAME 100
./train.sh $NAME 300
./train.sh $NAME 400
./train.sh $NAME 500
./train.sh $NAME 750
./train.sh $NAME 1000
./train.sh $NAME 1200
...


It does short trainings, save the models into a folder and run lstmeval. At
the end I get a report like this:

ext1-g_100: Eval Char error rate=1.4585826, Word error rate=13.347458
ext1-g_300: Eval Char error rate=0.97829078, Word error rate=8.4745763
ext1-g_400: Eval Char error rate=0.75069704, Word error rate=7.6271186
ext1-g_500: Eval Char error rate=0.68842175, Word error rate=7.2033898
ext1-g_750: Eval Char error rate=0.63577665, Word error rate=6.779661
ext1-g_1000: Eval Char error rate=0.50223788, Word error rate=5.0847458
ext1-g_1200: Eval Char error rate=0.47848338, Word error rate=5.5084746
ext1-g_1400: Eval Char error rate=0.50223788, Word error rate=5.9322034
ext1-g_1600: Eval Char error rate=0.47848338, Word error rate=5.0847458
ext1-g_1800: Eval Char error rate=0.42583829, Word error rate=4.6610169
ext1-g_2000: Eval Char error rate=0.4264803, Word error rate=4.2372881
ext1-g_2250: Eval Char error rate=0.44124661, Word error rate=5.0847458
ext1-g_2500: Eval Char error rate=0.42134419, Word error rate=4.2372881
ext1-g_3000: Eval Char error rate=0.42583829, Word error rate=3.9548023
ext1-g_3500: Eval Char error rate=0.3545748, Word error rate=2.9661017
ext1-g_4000: Eval Char error rate=0.42070218, Word error rate=2.9661017
ext1-g_4500: Eval Char error rate=0.38218138, Word error rate=2.9661017
ext1-g_5000: Eval Char error rate=0.42070218, Word error rate=3.3898305
ext1-g_5500: Eval Char error rate=0.37768728, Word error rate=2.1186441
ext1-g_6000: Eval Char error rate=0.38731748, Word error rate=2.5423729
ext1-g_6500: Eval Char error rate=0.34879668, Word error rate=2.1186441
ext1-g_7000: Eval Char error rate=0.40529386, Word error rate=2.6836158

and I can choose which model to use. Here I would pick the 3500 or the
6500: usually I prefer to pick an early one not to risk overfitting. I
could also decide to train a little more (8000, 9000, ...) to see if it
improves more but it is already oscillating around a certain value.

One note: evaluation score is just a reference unless you have a lot of
real world data. If you are using synthetic data this will likely differ
from the real world data so it is important not to overfit over it.

You can improve the script with an iteration and stop if the improvement
over the best result is below a threshold for a few epochs. I found no real
advantage in doing this as the training is quite fast and I have no problem
in letting it run while I do something else.



Lorenzo


Il giorno gio 18 apr 2019 alle ore 05:55 易鑫 <yixinlucky...@gmail.com> ha
scritto:

> Thank you very much.
> >>"Train for a few epochs (100 or 1000 depending on how much data you
> have), stop it and check with lstmeval if the *eval score* is improving.
> Restart the training adding 100/1000 to the max_iterations and continue
> from the previous model and repeat until the eval score stops to improve,
> or gets worse, for a few iterations."
>
> The eval  step is manual. The user should stop training and then check the
> eval data, then go on training ......
> Is there any method can do the eval automatically. I mean each epochs we
> can see the training error and eval error.
>
> Thanks.
>
>
> Shree Devi Kumar <shreesh...@gmail.com> 于2019年4月18日周四 上午1:16写道：
>
>> >BTW, for anybody: is there a way to query a model or a checkpoint for
>> the net_specs?
>>
>> There is no existing utility to do that. However, Ray had dumped the info
>> for tessdata_fast (and partly for tessdata_best) which has been posted in
>> the wiki at
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast
>>
>>
>> On Wed, Apr 17, 2019 at 1:40 PM Lorenzo Bolzani <l.bolz...@gmail.com>
>> wrote:
>>
>>>
>>> Split the data set in two parts (80/20 for example), use the large one
>>> for training and the other for evaluation.
>>>
>>> Train for a few epochs (100 or 1000 depending on how much data you
>>> have), stop it and check with lstmeval if the *eval score* is
>>> improving. Restart the training adding 100/1000 to the max_iterations and
>>> continue from the previous model and repeat until the eval score stops to
>>> improve, or gets worse, for a few iterations.
>>>
>>> You can use something like this for the split:
>>>
>>> cd train_folder/
>>> ls | shuf | head -NNN | parallel mv {} eval_folder/
>>>
>>>
>>> You can have a look here for a similar setup:
>>> https://github.com/OCR-D/ocrd-train
>>>
>>>
>>> Also you do not strictly need to use append_index for simple fine
>>> tuning, have a look at ocrd-train. If you are training for weird stuff it
>>> could help.
>>>
>>> I think
>>> <https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast#version-string--40000alpha--network-specification>
>>> (also <https://github.com/tesseract-ocr/tesseract/issues/1404>) that
>>> fast model uses 192 for the final lstm layer, 384 for default, 512 for best
>>> model.
>>>
>>>
>>>
>>> BTW, for anybody: is there a way to query a model or a checkpoint for
>>> the net_specs?
>>>
>>>
>>> Lorenzo
>>>
>>>
>>>
>>>
>>> Il giorno mer 17 apr 2019 alle ore 05:35 <yixinlucky...@gmail.com> ha
>>> scritto:
>>>
>>>> Hello,everyone:
>>>>    Now I am training use LSTM 4.0,here is my command:
>>>>
>>>> rm ~/tesstutorial/chi_sim_train -rf
>>>>
>>>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --training_text
>>>> ../training_data/chi_sim_layer_training_text  \
>>>> --langdata_dir ../langdata_lstm --tessdata_dir ./tessdata --lang
>>>> chi_sim --linedata_only --noextract_font_properties  --exposures "0" \
>>>> --maxpages 0 \
>>>> --workspace_dir ~/share/workspace/tmp \
>>>> --save_box_tiff \
>>>>  --fontlist  "NSimSun" \
>>>>         "Times New Roman" \
>>>>        "Arial Unicode MS" \
>>>>        "SimSun" \
>>>>        "Noto Sans CJK SC" \
>>>> "Noto Sans Mono CJK SC" \
>>>> --output_dir ~/tesstutorial/chi_sim_train \
>>>> --overwrite
>>>>
>>>> rm ~/tesstutorial/chi_sim_layer_from_chi_sim -rf
>>>>
>>>> mkdir -p ~/tesstutorial/chi_sim_layer_from_chi_sim
>>>>
>>>> combine_tessdata -e ../tessdata_best/chi_sim.traineddata
>>>> ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim.lstm
>>>>
>>>> lstmtraining --model_output
>>>> ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim_layer  \
>>>> --continue_from ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim.lstm \
>>>> --traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata \
>>>> --append_index 5 --net_spec '[Lfx128 O1c1]' \
>>>> --train_listfile
>>>> ~/tesstutorial/chi_sim_train/chi_sim.training_files.txt \
>>>> *--max_iterations 30000*
>>>>
>>>> lstmtraining --stop_training --continue_from
>>>> ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim_layer_checkpoint  \
>>>>            --traineddata
>>>> ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata --model_output
>>>> ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim_layer.traineddata
>>>>
>>>>
>>>>
>>>> My question is how to decide the stop condition,I tried many
>>>> max_iterations values,but the results are not so good.
>>>>
>>>> Thank you in advance.
>>>>
>>>> Sorry for my poor English.
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/92c126cb-525e-4c2f-a1c8-bbd36db09e51%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/92c126cb-525e-4c2f-a1c8-bbd36db09e51%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyBMnuviZU2m19Y3r492N_D36MOjp4S57bEvpaqnPyJAQ%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyBMnuviZU2m19Y3r492N_D36MOjp4S57bEvpaqnPyJAQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> --
>>
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVtnPq3RwE6ZBuOgPPXS2fgMhSc7j%3DZwtYergWQEuS4Ag%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVtnPq3RwE6ZBuOgPPXS2fgMhSc7j%3DZwtYergWQEuS4Ag%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAPiKE23TJxtR6f8WF0_8e8PMCvRHjuB0WLH93P005iVFLN%2B2Og%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAPiKE23TJxtR6f8WF0_8e8PMCvRHjuB0WLH93P005iVFLN%2B2Og%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwH%2BXSP3W8t39kRHCs_umTUFZkPv9tMMUHEzRZwiVQmBA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] How to choose the stop condition of LSTM training

Reply via email to