Yes, lstmeval is manual but easy to automate. I use a script like this: ./train.sh $NAME 100 ./train.sh $NAME 300 ./train.sh $NAME 400 ./train.sh $NAME 500 ./train.sh $NAME 750 ./train.sh $NAME 1000 ./train.sh $NAME 1200 ...
It does short trainings, save the models into a folder and run lstmeval. At the end I get a report like this: ext1-g_100: Eval Char error rate=1.4585826, Word error rate=13.347458 ext1-g_300: Eval Char error rate=0.97829078, Word error rate=8.4745763 ext1-g_400: Eval Char error rate=0.75069704, Word error rate=7.6271186 ext1-g_500: Eval Char error rate=0.68842175, Word error rate=7.2033898 ext1-g_750: Eval Char error rate=0.63577665, Word error rate=6.779661 ext1-g_1000: Eval Char error rate=0.50223788, Word error rate=5.0847458 ext1-g_1200: Eval Char error rate=0.47848338, Word error rate=5.5084746 ext1-g_1400: Eval Char error rate=0.50223788, Word error rate=5.9322034 ext1-g_1600: Eval Char error rate=0.47848338, Word error rate=5.0847458 ext1-g_1800: Eval Char error rate=0.42583829, Word error rate=4.6610169 ext1-g_2000: Eval Char error rate=0.4264803, Word error rate=4.2372881 ext1-g_2250: Eval Char error rate=0.44124661, Word error rate=5.0847458 ext1-g_2500: Eval Char error rate=0.42134419, Word error rate=4.2372881 ext1-g_3000: Eval Char error rate=0.42583829, Word error rate=3.9548023 ext1-g_3500: Eval Char error rate=0.3545748, Word error rate=2.9661017 ext1-g_4000: Eval Char error rate=0.42070218, Word error rate=2.9661017 ext1-g_4500: Eval Char error rate=0.38218138, Word error rate=2.9661017 ext1-g_5000: Eval Char error rate=0.42070218, Word error rate=3.3898305 ext1-g_5500: Eval Char error rate=0.37768728, Word error rate=2.1186441 ext1-g_6000: Eval Char error rate=0.38731748, Word error rate=2.5423729 ext1-g_6500: Eval Char error rate=0.34879668, Word error rate=2.1186441 ext1-g_7000: Eval Char error rate=0.40529386, Word error rate=2.6836158 and I can choose which model to use. Here I would pick the 3500 or the 6500: usually I prefer to pick an early one not to risk overfitting. I could also decide to train a little more (8000, 9000, ...) to see if it improves more but it is already oscillating around a certain value. One note: evaluation score is just a reference unless you have a lot of real world data. If you are using synthetic data this will likely differ from the real world data so it is important not to overfit over it. You can improve the script with an iteration and stop if the improvement over the best result is below a threshold for a few epochs. I found no real advantage in doing this as the training is quite fast and I have no problem in letting it run while I do something else. Lorenzo Il giorno gio 18 apr 2019 alle ore 05:55 易鑫 <yixinlucky...@gmail.com> ha scritto: > Thank you very much. > >>"Train for a few epochs (100 or 1000 depending on how much data you > have), stop it and check with lstmeval if the *eval score* is improving. > Restart the training adding 100/1000 to the max_iterations and continue > from the previous model and repeat until the eval score stops to improve, > or gets worse, for a few iterations." > > The eval step is manual. The user should stop training and then check the > eval data, then go on training ...... > Is there any method can do the eval automatically. I mean each epochs we > can see the training error and eval error. > > Thanks. > > > Shree Devi Kumar <shreesh...@gmail.com> 于2019年4月18日周四 上午1:16写道: > >> >BTW, for anybody: is there a way to query a model or a checkpoint for >> the net_specs? >> >> There is no existing utility to do that. However, Ray had dumped the info >> for tessdata_fast (and partly for tessdata_best) which has been posted in >> the wiki at >> >> https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast >> >> >> On Wed, Apr 17, 2019 at 1:40 PM Lorenzo Bolzani <l.bolz...@gmail.com> >> wrote: >> >>> >>> Split the data set in two parts (80/20 for example), use the large one >>> for training and the other for evaluation. >>> >>> Train for a few epochs (100 or 1000 depending on how much data you >>> have), stop it and check with lstmeval if the *eval score* is >>> improving. Restart the training adding 100/1000 to the max_iterations and >>> continue from the previous model and repeat until the eval score stops to >>> improve, or gets worse, for a few iterations. >>> >>> You can use something like this for the split: >>> >>> cd train_folder/ >>> ls | shuf | head -NNN | parallel mv {} eval_folder/ >>> >>> >>> You can have a look here for a similar setup: >>> https://github.com/OCR-D/ocrd-train >>> >>> >>> Also you do not strictly need to use append_index for simple fine >>> tuning, have a look at ocrd-train. If you are training for weird stuff it >>> could help. >>> >>> I think >>> <https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast#version-string--40000alpha--network-specification> >>> (also <https://github.com/tesseract-ocr/tesseract/issues/1404>) that >>> fast model uses 192 for the final lstm layer, 384 for default, 512 for best >>> model. >>> >>> >>> >>> BTW, for anybody: is there a way to query a model or a checkpoint for >>> the net_specs? >>> >>> >>> Lorenzo >>> >>> >>> >>> >>> Il giorno mer 17 apr 2019 alle ore 05:35 <yixinlucky...@gmail.com> ha >>> scritto: >>> >>>> Hello,everyone: >>>> Now I am training use LSTM 4.0,here is my command: >>>> >>>> rm ~/tesstutorial/chi_sim_train -rf >>>> >>>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --training_text >>>> ../training_data/chi_sim_layer_training_text \ >>>> --langdata_dir ../langdata_lstm --tessdata_dir ./tessdata --lang >>>> chi_sim --linedata_only --noextract_font_properties --exposures "0" \ >>>> --maxpages 0 \ >>>> --workspace_dir ~/share/workspace/tmp \ >>>> --save_box_tiff \ >>>> --fontlist "NSimSun" \ >>>> "Times New Roman" \ >>>> "Arial Unicode MS" \ >>>> "SimSun" \ >>>> "Noto Sans CJK SC" \ >>>> "Noto Sans Mono CJK SC" \ >>>> --output_dir ~/tesstutorial/chi_sim_train \ >>>> --overwrite >>>> >>>> rm ~/tesstutorial/chi_sim_layer_from_chi_sim -rf >>>> >>>> mkdir -p ~/tesstutorial/chi_sim_layer_from_chi_sim >>>> >>>> combine_tessdata -e ../tessdata_best/chi_sim.traineddata >>>> ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim.lstm >>>> >>>> lstmtraining --model_output >>>> ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim_layer \ >>>> --continue_from ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim.lstm \ >>>> --traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata \ >>>> --append_index 5 --net_spec '[Lfx128 O1c1]' \ >>>> --train_listfile >>>> ~/tesstutorial/chi_sim_train/chi_sim.training_files.txt \ >>>> *--max_iterations 30000* >>>> >>>> lstmtraining --stop_training --continue_from >>>> ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim_layer_checkpoint \ >>>> --traineddata >>>> ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata --model_output >>>> ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim_layer.traineddata >>>> >>>> >>>> >>>> My question is how to decide the stop condition,I tried many >>>> max_iterations values,but the results are not so good. >>>> >>>> Thank you in advance. >>>> >>>> Sorry for my poor English. >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-ocr+unsubscr...@googlegroups.com. >>>> To post to this group, send email to tesseract-ocr@googlegroups.com. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/92c126cb-525e-4c2f-a1c8-bbd36db09e51%40googlegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/92c126cb-525e-4c2f-a1c8-bbd36db09e51%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-ocr+unsubscr...@googlegroups.com. >>> To post to this group, send email to tesseract-ocr@googlegroups.com. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyBMnuviZU2m19Y3r492N_D36MOjp4S57bEvpaqnPyJAQ%40mail.gmail.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyBMnuviZU2m19Y3r492N_D36MOjp4S57bEvpaqnPyJAQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> >> -- >> >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com. >> To post to this group, send email to tesseract-ocr@googlegroups.com. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVtnPq3RwE6ZBuOgPPXS2fgMhSc7j%3DZwtYergWQEuS4Ag%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVtnPq3RwE6ZBuOgPPXS2fgMhSc7j%3DZwtYergWQEuS4Ag%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAPiKE23TJxtR6f8WF0_8e8PMCvRHjuB0WLH93P005iVFLN%2B2Og%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAPiKE23TJxtR6f8WF0_8e8PMCvRHjuB0WLH93P005iVFLN%2B2Og%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwH%2BXSP3W8t39kRHCs_umTUFZkPv9tMMUHEzRZwiVQmBA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.