Thank you. I see. Lorenzo Bolzani <l.bolz...@gmail.com> 于2019年4月18日周四 下午3:00写道:
> Yes, lstmeval is manual but easy to automate. I use a script like this: > > ./train.sh $NAME 100 > ./train.sh $NAME 300 > ./train.sh $NAME 400 > ./train.sh $NAME 500 > ./train.sh $NAME 750 > ./train.sh $NAME 1000 > ./train.sh $NAME 1200 > ... > > It does short trainings, save the models into a folder and run lstmeval. > At the end I get a report like this: > > ext1-g_100: Eval Char error rate=1.4585826, Word error rate=13.347458 > ext1-g_300: Eval Char error rate=0.97829078, Word error rate=8.4745763 > ext1-g_400: Eval Char error rate=0.75069704, Word error rate=7.6271186 > ext1-g_500: Eval Char error rate=0.68842175, Word error rate=7.2033898 > ext1-g_750: Eval Char error rate=0.63577665, Word error rate=6.779661 > ext1-g_1000: Eval Char error rate=0.50223788, Word error rate=5.0847458 > ext1-g_1200: Eval Char error rate=0.47848338, Word error rate=5.5084746 > ext1-g_1400: Eval Char error rate=0.50223788, Word error rate=5.9322034 > ext1-g_1600: Eval Char error rate=0.47848338, Word error rate=5.0847458 > ext1-g_1800: Eval Char error rate=0.42583829, Word error rate=4.6610169 > ext1-g_2000: Eval Char error rate=0.4264803, Word error rate=4.2372881 > ext1-g_2250: Eval Char error rate=0.44124661, Word error rate=5.0847458 > ext1-g_2500: Eval Char error rate=0.42134419, Word error rate=4.2372881 > ext1-g_3000: Eval Char error rate=0.42583829, Word error rate=3.9548023 > ext1-g_3500: Eval Char error rate=0.3545748, Word error rate=2.9661017 > ext1-g_4000: Eval Char error rate=0.42070218, Word error rate=2.9661017 > ext1-g_4500: Eval Char error rate=0.38218138, Word error rate=2.9661017 > ext1-g_5000: Eval Char error rate=0.42070218, Word error rate=3.3898305 > ext1-g_5500: Eval Char error rate=0.37768728, Word error rate=2.1186441 > ext1-g_6000: Eval Char error rate=0.38731748, Word error rate=2.5423729 > ext1-g_6500: Eval Char error rate=0.34879668, Word error rate=2.1186441 > ext1-g_7000: Eval Char error rate=0.40529386, Word error rate=2.6836158 > > and I can choose which model to use. Here I would pick the 3500 or the > 6500: usually I prefer to pick an early one not to risk overfitting. I > could also decide to train a little more (8000, 9000, ...) to see if it > improves more but it is already oscillating around a certain value. > > One note: evaluation score is just a reference unless you have a lot of > real world data. If you are using synthetic data this will likely differ > from the real world data so it is important not to overfit over it. > > You can improve the script with an iteration and stop if the improvement > over the best result is below a threshold for a few epochs. I found no real > advantage in doing this as the training is quite fast and I have no problem > in letting it run while I do something else. > > > > Lorenzo > > > Il giorno gio 18 apr 2019 alle ore 05:55 易鑫 <yixinlucky...@gmail.com> ha > scritto: > >> Thank you very much. >> >>"Train for a few epochs (100 or 1000 depending on how much data you >> have), stop it and check with lstmeval if the *eval score* is improving. >> Restart the training adding 100/1000 to the max_iterations and continue >> from the previous model and repeat until the eval score stops to improve, >> or gets worse, for a few iterations." >> >> The eval step is manual. The user should stop training and then check >> the eval data, then go on training ...... >> Is there any method can do the eval automatically. I mean each epochs we >> can see the training error and eval error. >> >> Thanks. >> >> >> Shree Devi Kumar <shreesh...@gmail.com> 于2019年4月18日周四 上午1:16写道: >> >>> >BTW, for anybody: is there a way to query a model or a checkpoint for >>> the net_specs? >>> >>> There is no existing utility to do that. However, Ray had dumped the >>> info for tessdata_fast (and partly for tessdata_best) which has been posted >>> in the wiki at >>> >>> https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast >>> >>> >>> On Wed, Apr 17, 2019 at 1:40 PM Lorenzo Bolzani <l.bolz...@gmail.com> >>> wrote: >>> >>>> >>>> Split the data set in two parts (80/20 for example), use the large one >>>> for training and the other for evaluation. >>>> >>>> Train for a few epochs (100 or 1000 depending on how much data you >>>> have), stop it and check with lstmeval if the *eval score* is >>>> improving. Restart the training adding 100/1000 to the max_iterations and >>>> continue from the previous model and repeat until the eval score stops to >>>> improve, or gets worse, for a few iterations. >>>> >>>> You can use something like this for the split: >>>> >>>> cd train_folder/ >>>> ls | shuf | head -NNN | parallel mv {} eval_folder/ >>>> >>>> >>>> You can have a look here for a similar setup: >>>> https://github.com/OCR-D/ocrd-train >>>> >>>> >>>> Also you do not strictly need to use append_index for simple fine >>>> tuning, have a look at ocrd-train. If you are training for weird stuff it >>>> could help. >>>> >>>> I think >>>> <https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast#version-string--40000alpha--network-specification> >>>> (also <https://github.com/tesseract-ocr/tesseract/issues/1404>) that >>>> fast model uses 192 for the final lstm layer, 384 for default, 512 for best >>>> model. >>>> >>>> >>>> >>>> BTW, for anybody: is there a way to query a model or a checkpoint for >>>> the net_specs? >>>> >>>> >>>> Lorenzo >>>> >>>> >>>> >>>> >>>> Il giorno mer 17 apr 2019 alle ore 05:35 <yixinlucky...@gmail.com> ha >>>> scritto: >>>> >>>>> Hello,everyone: >>>>> Now I am training use LSTM 4.0,here is my command: >>>>> >>>>> rm ~/tesstutorial/chi_sim_train -rf >>>>> >>>>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --training_text >>>>> ../training_data/chi_sim_layer_training_text \ >>>>> --langdata_dir ../langdata_lstm --tessdata_dir ./tessdata --lang >>>>> chi_sim --linedata_only --noextract_font_properties --exposures "0" \ >>>>> --maxpages 0 \ >>>>> --workspace_dir ~/share/workspace/tmp \ >>>>> --save_box_tiff \ >>>>> --fontlist "NSimSun" \ >>>>> "Times New Roman" \ >>>>> "Arial Unicode MS" \ >>>>> "SimSun" \ >>>>> "Noto Sans CJK SC" \ >>>>> "Noto Sans Mono CJK SC" \ >>>>> --output_dir ~/tesstutorial/chi_sim_train \ >>>>> --overwrite >>>>> >>>>> rm ~/tesstutorial/chi_sim_layer_from_chi_sim -rf >>>>> >>>>> mkdir -p ~/tesstutorial/chi_sim_layer_from_chi_sim >>>>> >>>>> combine_tessdata -e ../tessdata_best/chi_sim.traineddata >>>>> ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim.lstm >>>>> >>>>> lstmtraining --model_output >>>>> ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim_layer \ >>>>> --continue_from ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim.lstm >>>>> \ >>>>> --traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata >>>>> \ >>>>> --append_index 5 --net_spec '[Lfx128 O1c1]' \ >>>>> --train_listfile >>>>> ~/tesstutorial/chi_sim_train/chi_sim.training_files.txt \ >>>>> *--max_iterations 30000* >>>>> >>>>> lstmtraining --stop_training --continue_from >>>>> ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim_layer_checkpoint \ >>>>> --traineddata >>>>> ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata --model_output >>>>> ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim_layer.traineddata >>>>> >>>>> >>>>> >>>>> My question is how to decide the stop condition,I tried many >>>>> max_iterations values,but the results are not so good. >>>>> >>>>> Thank you in advance. >>>>> >>>>> Sorry for my poor English. >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesseract-ocr+unsubscr...@googlegroups.com. >>>>> To post to this group, send email to tesseract-ocr@googlegroups.com. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/92c126cb-525e-4c2f-a1c8-bbd36db09e51%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/92c126cb-525e-4c2f-a1c8-bbd36db09e51%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-ocr+unsubscr...@googlegroups.com. >>>> To post to this group, send email to tesseract-ocr@googlegroups.com. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyBMnuviZU2m19Y3r492N_D36MOjp4S57bEvpaqnPyJAQ%40mail.gmail.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyBMnuviZU2m19Y3r492N_D36MOjp4S57bEvpaqnPyJAQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> >>> -- >>> >>> ____________________________________________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-ocr+unsubscr...@googlegroups.com. >>> To post to this group, send email to tesseract-ocr@googlegroups.com. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVtnPq3RwE6ZBuOgPPXS2fgMhSc7j%3DZwtYergWQEuS4Ag%40mail.gmail.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVtnPq3RwE6ZBuOgPPXS2fgMhSc7j%3DZwtYergWQEuS4Ag%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com. >> To post to this group, send email to tesseract-ocr@googlegroups.com. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAPiKE23TJxtR6f8WF0_8e8PMCvRHjuB0WLH93P005iVFLN%2B2Og%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CAPiKE23TJxtR6f8WF0_8e8PMCvRHjuB0WLH93P005iVFLN%2B2Og%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwH%2BXSP3W8t39kRHCs_umTUFZkPv9tMMUHEzRZwiVQmBA%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwH%2BXSP3W8t39kRHCs_umTUFZkPv9tMMUHEzRZwiVQmBA%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAPiKE20_ZDKkGDe0EO52daKFPeiFhShhCaf635JcKUjERYzZjw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.