Re: [tesseract-ocr] How to choose the stop condition of LSTM training

易鑫 Thu, 18 Apr 2019 02:01:37 -0700

Thank you. I see.

Lorenzo Bolzani <l.bolz...@gmail.com> 于2019年4月18日周四 下午3:00写道：


> Yes, lstmeval is manual but easy to automate. I use a script like this:
>
> ./train.sh $NAME 100
> ./train.sh $NAME 300
> ./train.sh $NAME 400
> ./train.sh $NAME 500
> ./train.sh $NAME 750
> ./train.sh $NAME 1000
> ./train.sh $NAME 1200
> ...
>
> It does short trainings, save the models into a folder and run lstmeval.
> At the end I get a report like this:
>
> ext1-g_100: Eval Char error rate=1.4585826, Word error rate=13.347458
> ext1-g_300: Eval Char error rate=0.97829078, Word error rate=8.4745763
> ext1-g_400: Eval Char error rate=0.75069704, Word error rate=7.6271186
> ext1-g_500: Eval Char error rate=0.68842175, Word error rate=7.2033898
> ext1-g_750: Eval Char error rate=0.63577665, Word error rate=6.779661
> ext1-g_1000: Eval Char error rate=0.50223788, Word error rate=5.0847458
> ext1-g_1200: Eval Char error rate=0.47848338, Word error rate=5.5084746
> ext1-g_1400: Eval Char error rate=0.50223788, Word error rate=5.9322034
> ext1-g_1600: Eval Char error rate=0.47848338, Word error rate=5.0847458
> ext1-g_1800: Eval Char error rate=0.42583829, Word error rate=4.6610169
> ext1-g_2000: Eval Char error rate=0.4264803, Word error rate=4.2372881
> ext1-g_2250: Eval Char error rate=0.44124661, Word error rate=5.0847458
> ext1-g_2500: Eval Char error rate=0.42134419, Word error rate=4.2372881
> ext1-g_3000: Eval Char error rate=0.42583829, Word error rate=3.9548023
> ext1-g_3500: Eval Char error rate=0.3545748, Word error rate=2.9661017
> ext1-g_4000: Eval Char error rate=0.42070218, Word error rate=2.9661017
> ext1-g_4500: Eval Char error rate=0.38218138, Word error rate=2.9661017
> ext1-g_5000: Eval Char error rate=0.42070218, Word error rate=3.3898305
> ext1-g_5500: Eval Char error rate=0.37768728, Word error rate=2.1186441
> ext1-g_6000: Eval Char error rate=0.38731748, Word error rate=2.5423729
> ext1-g_6500: Eval Char error rate=0.34879668, Word error rate=2.1186441
> ext1-g_7000: Eval Char error rate=0.40529386, Word error rate=2.6836158
>
> and I can choose which model to use. Here I would pick the 3500 or the
> 6500: usually I prefer to pick an early one not to risk overfitting. I
> could also decide to train a little more (8000, 9000, ...) to see if it
> improves more but it is already oscillating around a certain value.
>
> One note: evaluation score is just a reference unless you have a lot of
> real world data. If you are using synthetic data this will likely differ
> from the real world data so it is important not to overfit over it.
>
> You can improve the script with an iteration and stop if the improvement
> over the best result is below a threshold for a few epochs. I found no real
> advantage in doing this as the training is quite fast and I have no problem
> in letting it run while I do something else.
>
>
>
> Lorenzo
>
>
> Il giorno gio 18 apr 2019 alle ore 05:55 易鑫 <yixinlucky...@gmail.com> ha
> scritto:
>
>> Thank you very much.
>> >>"Train for a few epochs (100 or 1000 depending on how much data you
>> have), stop it and check with lstmeval if the *eval score* is improving.
>> Restart the training adding 100/1000 to the max_iterations and continue
>> from the previous model and repeat until the eval score stops to improve,
>> or gets worse, for a few iterations."
>>
>> The eval  step is manual. The user should stop training and then check
>> the eval data, then go on training ......
>> Is there any method can do the eval automatically. I mean each epochs we
>> can see the training error and eval error.
>>
>> Thanks.
>>
>>
>> Shree Devi Kumar <shreesh...@gmail.com> 于2019年4月18日周四 上午1:16写道：
>>
>>> >BTW, for anybody: is there a way to query a model or a checkpoint for
>>> the net_specs?
>>>
>>> There is no existing utility to do that. However, Ray had dumped the
>>> info for tessdata_fast (and partly for tessdata_best) which has been posted
>>> in the wiki at
>>>
>>> https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast
>>>
>>>
>>> On Wed, Apr 17, 2019 at 1:40 PM Lorenzo Bolzani <l.bolz...@gmail.com>
>>> wrote:
>>>
>>>>
>>>> Split the data set in two parts (80/20 for example), use the large one
>>>> for training and the other for evaluation.
>>>>
>>>> Train for a few epochs (100 or 1000 depending on how much data you
>>>> have), stop it and check with lstmeval if the *eval score* is
>>>> improving. Restart the training adding 100/1000 to the max_iterations and
>>>> continue from the previous model and repeat until the eval score stops to
>>>> improve, or gets worse, for a few iterations.
>>>>
>>>> You can use something like this for the split:
>>>>
>>>> cd train_folder/
>>>> ls | shuf | head -NNN | parallel mv {} eval_folder/
>>>>
>>>>
>>>> You can have a look here for a similar setup:
>>>> https://github.com/OCR-D/ocrd-train
>>>>
>>>>
>>>> Also you do not strictly need to use append_index for simple fine
>>>> tuning, have a look at ocrd-train. If you are training for weird stuff it
>>>> could help.
>>>>
>>>> I think
>>>> <https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast#version-string--40000alpha--network-specification>
>>>> (also <https://github.com/tesseract-ocr/tesseract/issues/1404>) that
>>>> fast model uses 192 for the final lstm layer, 384 for default, 512 for best
>>>> model.
>>>>
>>>>
>>>>
>>>> BTW, for anybody: is there a way to query a model or a checkpoint for
>>>> the net_specs?
>>>>
>>>>
>>>> Lorenzo
>>>>
>>>>
>>>>
>>>>
>>>> Il giorno mer 17 apr 2019 alle ore 05:35 <yixinlucky...@gmail.com> ha
>>>> scritto:
>>>>
>>>>> Hello,everyone:
>>>>>    Now I am training use LSTM 4.0,here is my command:
>>>>>
>>>>> rm ~/tesstutorial/chi_sim_train -rf
>>>>>
>>>>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --training_text
>>>>> ../training_data/chi_sim_layer_training_text  \
>>>>> --langdata_dir ../langdata_lstm --tessdata_dir ./tessdata --lang
>>>>> chi_sim --linedata_only --noextract_font_properties  --exposures "0" \
>>>>> --maxpages 0 \
>>>>> --workspace_dir ~/share/workspace/tmp \
>>>>> --save_box_tiff \
>>>>>  --fontlist  "NSimSun" \
>>>>>         "Times New Roman" \
>>>>>        "Arial Unicode MS" \
>>>>>        "SimSun" \
>>>>>        "Noto Sans CJK SC" \
>>>>> "Noto Sans Mono CJK SC" \
>>>>> --output_dir ~/tesstutorial/chi_sim_train \
>>>>> --overwrite
>>>>>
>>>>> rm ~/tesstutorial/chi_sim_layer_from_chi_sim -rf
>>>>>
>>>>> mkdir -p ~/tesstutorial/chi_sim_layer_from_chi_sim
>>>>>
>>>>> combine_tessdata -e ../tessdata_best/chi_sim.traineddata
>>>>> ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim.lstm
>>>>>
>>>>> lstmtraining --model_output
>>>>> ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim_layer  \
>>>>> --continue_from ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim.lstm
>>>>> \
>>>>> --traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata
>>>>> \
>>>>> --append_index 5 --net_spec '[Lfx128 O1c1]' \
>>>>> --train_listfile
>>>>> ~/tesstutorial/chi_sim_train/chi_sim.training_files.txt \
>>>>> *--max_iterations 30000*
>>>>>
>>>>> lstmtraining --stop_training --continue_from
>>>>> ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim_layer_checkpoint  \
>>>>>            --traineddata
>>>>> ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata --model_output
>>>>> ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim_layer.traineddata
>>>>>
>>>>>
>>>>>
>>>>> My question is how to decide the stop condition,I tried many
>>>>> max_iterations values,but the results are not so good.
>>>>>
>>>>> Thank you in advance.
>>>>>
>>>>> Sorry for my poor English.
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/92c126cb-525e-4c2f-a1c8-bbd36db09e51%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/92c126cb-525e-4c2f-a1c8-bbd36db09e51%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyBMnuviZU2m19Y3r492N_D36MOjp4S57bEvpaqnPyJAQ%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyBMnuviZU2m19Y3r492N_D36MOjp4S57bEvpaqnPyJAQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>> --
>>>
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVtnPq3RwE6ZBuOgPPXS2fgMhSc7j%3DZwtYergWQEuS4Ag%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVtnPq3RwE6ZBuOgPPXS2fgMhSc7j%3DZwtYergWQEuS4Ag%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAPiKE23TJxtR6f8WF0_8e8PMCvRHjuB0WLH93P005iVFLN%2B2Og%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAPiKE23TJxtR6f8WF0_8e8PMCvRHjuB0WLH93P005iVFLN%2B2Og%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwH%2BXSP3W8t39kRHCs_umTUFZkPv9tMMUHEzRZwiVQmBA%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwH%2BXSP3W8t39kRHCs_umTUFZkPv9tMMUHEzRZwiVQmBA%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAPiKE20_ZDKkGDe0EO52daKFPeiFhShhCaf635JcKUjERYzZjw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] How to choose the stop condition of LSTM training

Reply via email to