>BTW, for anybody: is there a way to query a model or a checkpoint for the
net_specs?

There is no existing utility to do that. However, Ray had dumped the info
for tessdata_fast (and partly for tessdata_best) which has been posted in
the wiki at
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast


On Wed, Apr 17, 2019 at 1:40 PM Lorenzo Bolzani <l.bolz...@gmail.com> wrote:

>
> Split the data set in two parts (80/20 for example), use the large one for
> training and the other for evaluation.
>
> Train for a few epochs (100 or 1000 depending on how much data you have),
> stop it and check with lstmeval if the *eval score* is improving. Restart
> the training adding 100/1000 to the max_iterations and continue from the
> previous model and repeat until the eval score stops to improve, or gets
> worse, for a few iterations.
>
> You can use something like this for the split:
>
> cd train_folder/
> ls | shuf | head -NNN | parallel mv {} eval_folder/
>
>
> You can have a look here for a similar setup:
> https://github.com/OCR-D/ocrd-train
>
>
> Also you do not strictly need to use append_index for simple fine tuning,
> have a look at ocrd-train. If you are training for weird stuff it could
> help.
>
> I think
> <https://github.com/tesseract-ocr/tesseract/wiki/Data-Files-in-tessdata_fast#version-string--40000alpha--network-specification>
> (also <https://github.com/tesseract-ocr/tesseract/issues/1404>) that fast
> model uses 192 for the final lstm layer, 384 for default, 512 for best
> model.
>
>
>
> BTW, for anybody: is there a way to query a model or a checkpoint for the
> net_specs?
>
>
> Lorenzo
>
>
>
>
> Il giorno mer 17 apr 2019 alle ore 05:35 <yixinlucky...@gmail.com> ha
> scritto:
>
>> Hello,everyone:
>>    Now I am training use LSTM 4.0,here is my command:
>>
>> rm ~/tesstutorial/chi_sim_train -rf
>>
>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --training_text
>> ../training_data/chi_sim_layer_training_text  \
>> --langdata_dir ../langdata_lstm --tessdata_dir ./tessdata --lang chi_sim
>> --linedata_only --noextract_font_properties  --exposures "0" \
>> --maxpages 0 \
>> --workspace_dir ~/share/workspace/tmp \
>> --save_box_tiff \
>>  --fontlist  "NSimSun" \
>>         "Times New Roman" \
>>        "Arial Unicode MS" \
>>        "SimSun" \
>>        "Noto Sans CJK SC" \
>> "Noto Sans Mono CJK SC" \
>> --output_dir ~/tesstutorial/chi_sim_train \
>> --overwrite
>>
>> rm ~/tesstutorial/chi_sim_layer_from_chi_sim -rf
>>
>> mkdir -p ~/tesstutorial/chi_sim_layer_from_chi_sim
>>
>> combine_tessdata -e ../tessdata_best/chi_sim.traineddata
>> ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim.lstm
>>
>> lstmtraining --model_output
>> ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim_layer  \
>> --continue_from ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim.lstm \
>> --traineddata ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata \
>> --append_index 5 --net_spec '[Lfx128 O1c1]' \
>> --train_listfile ~/tesstutorial/chi_sim_train/chi_sim.training_files.txt \
>> *--max_iterations 30000*
>>
>> lstmtraining --stop_training --continue_from
>> ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim_layer_checkpoint  \
>>            --traineddata
>> ~/tesstutorial/chi_sim_train/chi_sim/chi_sim.traineddata --model_output
>> ~/tesstutorial/chi_sim_layer_from_chi_sim/chi_sim_layer.traineddata
>>
>>
>>
>> My question is how to decide the stop condition,I tried many
>> max_iterations values,but the results are not so good.
>>
>> Thank you in advance.
>>
>> Sorry for my poor English.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/92c126cb-525e-4c2f-a1c8-bbd36db09e51%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/92c126cb-525e-4c2f-a1c8-bbd36db09e51%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyBMnuviZU2m19Y3r492N_D36MOjp4S57bEvpaqnPyJAQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyBMnuviZU2m19Y3r492N_D36MOjp4S57bEvpaqnPyJAQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVtnPq3RwE6ZBuOgPPXS2fgMhSc7j%3DZwtYergWQEuS4Ag%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to