>BTW, for anybody: is there a way to query a model or a checkpoint for the net_specs?
There is no existing utility to do that. However, Ray had dumped the info for tessdata_fast (and partly for tessdata_best) which has been posted in the wiki at https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast On Wed, Apr 17, 2019 at 1:40 PM Lorenzo Bolzani <l.bolz...@gmail.com> wrote: > > Split the data set in two parts (80/20 for example), use the large one for > training and the other for evaluation. > > Train for a few epochs (100 or 1000 depending on how much data you have), > stop it and check with lstmeval if the *eval score* is improving. Restart > the training adding 100/1000 to the max_iterations and continue from the > previous model and repeat until the eval score stops to improve, or gets > worse, for a few iterations. > > You can use something like this for the split: > > cd train_folder/ > ls | shuf | head -NNN | parallel mv {} eval_folder/ > > > You can have a look here for a similar setup: > https://github.com/OCR-D/ocrd-train > > > Also you do not strictly need to use append_index for simple fine tuning, > have a look at ocrd-train. If you are training for weird stuff it could > help. > > I think > <https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast#version-string--40000alpha--network-specification> > (also <https://github.com/tesseract-ocr/tesseract/issues/1404>) that fast > model uses 192 for the final lstm layer, 384 for default, 512 for best > model. > > > > BTW, for anybody: is there a way to query a model or a checkpoint for the > net_specs? > > > Lorenzo > > > > > Il giorno mer 17 apr 2019 alle ore 05:35 <yixinlucky...@gmail.com> ha > scritto: > >> Hello,everyone: >> Now I am training use LSTM 4.0,here is my command: >> >> rm ~/tesstutorial/chi_sim_train -rf >> >> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --training_text >> ../training_data/chi_sim_layer_training_text \ >> --langdata_dir ../langdata_lstm --tessdata_dir ./tessdata --lang chi_sim >> --linedata_only --noextract_font_properties --exposures "0" \ >> --maxpages 0 \ >> --workspace_dir ~/share/workspace/tmp \ >> --save_box_tiff \ >> --fontlist "NSimSun" \ >> "Times New Roman" \ >> "Arial Unicode MS" \ >> "SimSun" \ >> "Noto Sans CJK SC" \ >> "Noto Sans Mono CJK SC" \ >> --output_dir ~/tesstutorial/chi_sim_train \ >> --overwrite >> >> rm ~/tesstutorial/chi_sim_layer_from_chi_sim -rf >> >> mkdir -p ~/tesstutorial/chi_sim_layer_from_chi_sim >> >> combine_tessdata -e ../tessdata_best/chi_sim.traineddata >> ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim.lstm >> >> lstmtraining --model_output >> ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim_layer \ >> --continue_from ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim.lstm \ >> --traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata \ >> --append_index 5 --net_spec '[Lfx128 O1c1]' \ >> --train_listfile ~/tesstutorial/chi_sim_train/chi_sim.training_files.txt \ >> *--max_iterations 30000* >> >> lstmtraining --stop_training --continue_from >> ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim_layer_checkpoint \ >> --traineddata >> ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata --model_output >> ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim_layer.traineddata >> >> >> >> My question is how to decide the stop condition,I tried many >> max_iterations values,but the results are not so good. >> >> Thank you in advance. >> >> Sorry for my poor English. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com. >> To post to this group, send email to tesseract-ocr@googlegroups.com. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/92c126cb-525e-4c2f-a1c8-bbd36db09e51%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/92c126cb-525e-4c2f-a1c8-bbd36db09e51%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyBMnuviZU2m19Y3r492N_D36MOjp4S57bEvpaqnPyJAQ%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyBMnuviZU2m19Y3r492N_D36MOjp4S57bEvpaqnPyJAQ%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVtnPq3RwE6ZBuOgPPXS2fgMhSc7j%3DZwtYergWQEuS4Ag%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.