Hi Reza, Attached are two scripts and one log file. You will need to change the directories in the scripts.
finetune.sh and finetune log file are for a sample finetuning for eng. By changing the language code you can run it for fas. You can use that as a test. plus-fas.sh is for plusminus type of finetuning for fas. It merges the existing unicharset with the unicharset extracted from the training_text. You will need to update the training_text file in langdata/fas Optionally you can also review and update wordlist, numbers and punc file. The scripts should run if you give correct directory names. ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Sat, May 19, 2018 at 9:24 AM, reza <reza6...@gmail.com> wrote: > hi ShreeDevi > > Thanks. > > I tested the 2 models that you have provided. The accuracy on samples > without noise were about 98% but on scanned samples or captured images, > were about 80%. > but still it didn't work on different fonts. > Could u send all files that needed for training models? I want fine tune > the model with more fonts and diacritics . > > best regards > > > On Friday, May 18, 2018 at 8:49:54 PM UTC+4:30, shree wrote: >> >> I have posted a couple of test models for Farsi at >> https://github.com/Shreeshrii/tessdata_shreetest >> >> These have not been trained on text with diacritics as the normalization >> and training process was giving error on the combining marks. >> >> Please give them a try and see if they provide better recognition for >> numbers and text without combining marks. >> >> FYI, I do not know the Persian language so it is difficult for me to >> gauge if results are ok or not. >> >> ShreeDevi >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/fe15cedc-0a2a-41fc-ac3c-b80df458a509% > 40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/fe15cedc-0a2a-41fc-ac3c-b80df458a509%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVC17RjZXSkctsEYW6O6-mO-HAqJHZLZRQcfQsAxwxHeQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
ubuntu@tesseract-ocr:~/tess4training$ bash -x ./tesstrain_finetune.sh + MakeTraining=yes + MakeEval=yes + RunTraining=yes + Lang=eng + Continue_from_lang=eng + bestdata_dir=../tessdata_best + tessdata_dir=../tessdata + tesstrain_dir=../tesseract/src/training + langdata_dir=../langdata + fonts_dir=../.fonts + fonts_for_training=' '\''FreeSerif'\'' ' + fonts_for_eval=' '\''Arial'\'' ' + train_output_dir=./finetune_train_eng + eval_output_dir=./finetune_eval_eng + trained_output_dir=./finetune_trained_eng-from-eng + '[' yes = yes ']' + echo '###### MAKING TRAINING DATA ######' ###### MAKING TRAINING DATA ###### + rm -rf ./finetune_train_eng + mkdir ./finetune_train_eng + echo '#### run tesstrain.sh ####' #### run tesstrain.sh #### + eval bash ../tesseract/src/training/tesstrain.sh --lang eng --linedata_only -- noextract_font_properties --exposures 0 --fonts_dir ../.fonts --fontlist ''\''Fr eeSerif'\''' --langdata_dir ../langdata --tessdata_dir ../tessdata --training_te xt ../langdata/eng/eng.training_text --output_dir ./finetune_train_eng ++ bash ../tesseract/src/training/tesstrain.sh --lang eng --linedata_only --noex tract_font_properties --exposures 0 --fonts_dir ../.fonts --fontlist FreeSerif - -langdata_dir ../langdata --tessdata_dir ../tessdata --training_text ../langdata /eng/eng.training_text --output_dir ./finetune_train_eng === Starting training for language 'eng' [Sat May 19 04:20:00 UTC 2018] /usr/local/bin/text2image --fonts_dir=../.fonts - -font=FreeSerif --outputbase=/tmp/font_tmp.rSFglUi6Dq/sample_text.txt --text=/tm p/font_tmp.rSFglUi6Dq/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.rSFglUi6 Dq Rendered page 0 to file /tmp/font_tmp.rSFglUi6Dq/sample_text.txt.tif === Phase I: Generating training images === Rendering using FreeSerif [Sat May 19 04:20:02 UTC 2018] /usr/local/bin/text2image --fontconfig_tmpdir=/tm p/font_tmp.rSFglUi6Dq --fonts_dir=../.fonts --strip_unrenderable_words --leading =32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.RsxSMQxxED/eng/eng.Fre eSerif.exp0 --max_pages=0 --ptsize=12 --font=FreeSerif --text=../langdata/eng/en g.training_text Rendered page 0 to file /tmp/tmp.RsxSMQxxED/eng/eng.FreeSerif.exp0.tif Rendered page 1 to file /tmp/tmp.RsxSMQxxED/eng/eng.FreeSerif.exp0.tif === Phase UP: Generating unicharset and unichar properties files === [Sat May 19 04:20:04 UTC 2018] /usr/local/bin/unicharset_extractor --output_unicharset /tmp/tmp.RsxSMQxxED/eng/eng.unicharset --norm_mode 1 /tmp/tmp.RsxSMQxxED/eng/eng.FreeSerif.exp0.box Extracting unicharset from box file /tmp/tmp.RsxSMQxxED/eng/eng.FreeSerif.exp0.box Other case É of é is not in unicharset Wrote unicharset file /tmp/tmp.RsxSMQxxED/eng/eng.unicharset [Sat May 19 04:20:04 UTC 2018] /usr/local/bin/set_unicharset_properties -U /tmp/tmp.RsxSMQxxED/eng/eng.unicharset -O /tmp/tmp.RsxSMQxxED/eng/eng.unicharset -X /tmp/tmp.RsxSMQxxED/eng/eng.xheights --script_dir=../langdata Loaded unicharset of size 111 from file /tmp/tmp.RsxSMQxxED/eng/eng.unicharset Setting unichar properties Other case É of é is not in unicharset Setting script properties Warning: properties incomplete for index 25 = ~ Writing unicharset to file /tmp/tmp.RsxSMQxxED/eng/eng.unicharset === Phase E: Generating lstmf files === Using TESSDATA_PREFIX=../tessdata [Sat May 19 04:20:04 UTC 2018] /usr/local/bin/tesseract /tmp/tmp.RsxSMQxxED/eng/eng.FreeSerif.exp0.tif /tmp/tmp.RsxSMQxxED/eng/eng.FreeSerif.exp0 lstm.train Tesseract Open Source OCR Engine v4.0.0-beta.1-232-g45a6 with Leptonica Page 1 Page 2 Loaded 49/49 pages (1-49) of document /tmp/tmp.RsxSMQxxED/eng/eng.FreeSerif.exp0.lstmf === Constructing LSTM training data === [Sat May 19 04:20:07 UTC 2018] /usr/local/bin/combine_lang_model --input_unicharset /tmp/tmp.RsxSMQxxED/eng/eng.unicharset --script_dir ../langdata --words ../langdata/eng/eng.wordlist --numbers ../langdata/eng/eng.numbers --puncs ../langdata/eng/eng.punc --output_dir ./finetune_train_eng --lang eng Loaded unicharset of size 111 from file /tmp/tmp.RsxSMQxxED/eng/eng.unicharset Setting unichar properties Other case É of é is not in unicharset Setting script properties Config file is optional, continuing... Failed to read data from: ../langdata/eng/eng.config Null char=2 Reducing Trie to SquishedDawg Reducing Trie to SquishedDawg Reducing Trie to SquishedDawg Moving /tmp/tmp.RsxSMQxxED/eng/eng.FreeSerif.exp0.box to ./finetune_train_eng Moving /tmp/tmp.RsxSMQxxED/eng/eng.FreeSerif.exp0.tif to ./finetune_train_eng Moving /tmp/tmp.RsxSMQxxED/eng/eng.FreeSerif.exp0.lstmf to ./finetune_train_eng Created starter traineddata for language 'eng' Run lstmtraining to do the LSTM training for language 'eng' + echo '#### combine_tessdata to extract lstm model from '\''tessdata_best'\'' for eng ####' #### combine_tessdata to extract lstm model from 'tessdata_best' for eng #### + combine_tessdata -u ../tessdata_best/eng.traineddata ../tessdata_best/eng. Extracting tessdata components from ../tessdata_best/eng.traineddata Wrote ../tessdata_best/eng.lstm Wrote ../tessdata_best/eng.lstm-punc-dawg Wrote ../tessdata_best/eng.lstm-word-dawg Wrote ../tessdata_best/eng.lstm-number-dawg Wrote ../tessdata_best/eng.lstm-unicharset Wrote ../tessdata_best/eng.lstm-recoder Wrote ../tessdata_best/eng.version Version string:4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1] 17:lstm:size=11689099, offset=192 18:lstm-punc-dawg:size=4322, offset=11689291 19:lstm-word-dawg:size=3694794, offset=11693613 20:lstm-number-dawg:size=4738, offset=15388407 21:lstm-unicharset:size=6360, offset=15393145 22:lstm-recoder:size=1012, offset=15399505 23:version:size=80, offset=15400517 + '[' yes = yes ']' + echo '###### MAKING EVAL DATA ######' ###### MAKING EVAL DATA ###### + rm -rf ./finetune_eval_eng + mkdir ./finetune_eval_eng + eval bash ../tesseract/src/training/tesstrain.sh --fonts_dir ../.fonts --fontlist ''\''Arial'\''' --lang eng --linedata_only --noextract_font_properties --langdata_dir ../langdata --tessdata_dir ../tessdata --training_text ../langdata/eng/eng.training_text --output_dir ./finetune_eval_eng ++ bash ../tesseract/src/training/tesstrain.sh --fonts_dir ../.fonts --fontlist Arial --lang eng --linedata_only --noextract_font_properties --langdata_dir ../langdata --tessdata_dir ../tessdata --training_text ../langdata/eng/eng.training_text --output_dir ./finetune_eval_eng === Starting training for language 'eng' [Sat May 19 04:20:17 UTC 2018] /usr/local/bin/text2image --fonts_dir=../.fonts --font=Arial --outputbase=/tmp/font_tmp.2U3WwAANTl/sample_text.txt --text=/tmp/font_tmp.2U3WwAANTl/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.2U3WwAANTl Rendered page 0 to file /tmp/font_tmp.2U3WwAANTl/sample_text.txt.tif === Phase I: Generating training images === Rendering using Arial [Sat May 19 04:20:19 UTC 2018] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.2U3WwAANTl --fonts_dir=../.fonts --strip_unrenderable_words --leading=32 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/tmp.nOUY5Wx7C3/eng/eng.Arial.exp0 --max_pages=0 --ptsize=12 --font=Arial --text=../langdata/eng/eng.training_text Rendered page 0 to file /tmp/tmp.nOUY5Wx7C3/eng/eng.Arial.exp0.tif Rendered page 1 to file /tmp/tmp.nOUY5Wx7C3/eng/eng.Arial.exp0.tif === Phase UP: Generating unicharset and unichar properties files === [Sat May 19 04:20:21 UTC 2018] /usr/local/bin/unicharset_extractor --output_unicharset /tmp/tmp.nOUY5Wx7C3/eng/eng.unicharset --norm_mode 1 /tmp/tmp.nOUY5Wx7C3/eng/eng.Arial.exp0.box Extracting unicharset from box file /tmp/tmp.nOUY5Wx7C3/eng/eng.Arial.exp0.box Other case É of é is not in unicharset Wrote unicharset file /tmp/tmp.nOUY5Wx7C3/eng/eng.unicharset [Sat May 19 04:20:21 UTC 2018] /usr/local/bin/set_unicharset_properties -U /tmp/tmp.nOUY5Wx7C3/eng/eng.unicharset -O /tmp/tmp.nOUY5Wx7C3/eng/eng.unicharset -X /tmp/tmp.nOUY5Wx7C3/eng/eng.xheights --script_dir=../langdata Loaded unicharset of size 111 from file /tmp/tmp.nOUY5Wx7C3/eng/eng.unicharset Setting unichar properties Other case É of é is not in unicharset Setting script properties Warning: properties incomplete for index 25 = ~ Writing unicharset to file /tmp/tmp.nOUY5Wx7C3/eng/eng.unicharset === Phase E: Generating lstmf files === Using TESSDATA_PREFIX=../tessdata [Sat May 19 04:20:21 UTC 2018] /usr/local/bin/tesseract /tmp/tmp.nOUY5Wx7C3/eng/eng.Arial.exp0.tif /tmp/tmp.nOUY5Wx7C3/eng/eng.Arial.exp0 lstm.train Tesseract Open Source OCR Engine v4.0.0-beta.1-232-g45a6 with Leptonica Page 1 Page 2 Loaded 52/52 pages (1-52) of document /tmp/tmp.nOUY5Wx7C3/eng/eng.Arial.exp0.lstmf === Constructing LSTM training data === [Sat May 19 04:20:24 UTC 2018] /usr/local/bin/combine_lang_model --input_unicharset /tmp/tmp.nOUY5Wx7C3/eng/eng.unicharset --script_dir ../langdata --words ../langdata/eng/eng.wordlist --numbers ../langdata/eng/eng.numbers --puncs ../langdata/eng/eng.punc --output_dir ./finetune_eval_eng --lang eng Loaded unicharset of size 111 from file /tmp/tmp.nOUY5Wx7C3/eng/eng.unicharset Setting unichar properties Other case É of é is not in unicharset Setting script properties Config file is optional, continuing... Failed to read data from: ../langdata/eng/eng.config Null char=2 Reducing Trie to SquishedDawg Reducing Trie to SquishedDawg Reducing Trie to SquishedDawg Moving /tmp/tmp.nOUY5Wx7C3/eng/eng.Arial.exp0.box to ./finetune_eval_eng Moving /tmp/tmp.nOUY5Wx7C3/eng/eng.Arial.exp0.tif to ./finetune_eval_eng Moving /tmp/tmp.nOUY5Wx7C3/eng/eng.Arial.exp0.lstmf to ./finetune_eval_eng Created starter traineddata for language 'eng' Run lstmtraining to do the LSTM training for language 'eng' + '[' yes = yes ']' + echo '#### finetune training from ../tessdata_best/eng.traineddata #####' #### finetune training from ../tessdata_best/eng.traineddata ##### + rm -rf ./finetune_trained_eng-from-eng + mkdir -p ./finetune_trained_eng-from-eng + lstmtraining --continue_from ../tessdata_best/eng.lstm --traineddata ../tessdata_best/eng.traineddata --max_iterations 400 --debug_interval 0 --train_listfile ./finetune_train_eng/eng.training_files.txt --model_output ./finetune_trained_eng-from-eng/finetune Loaded file ../tessdata_best/eng.lstm, unpacking... Warning: LSTMTrainer deserialized an LSTMRecognizer! Continuing from ../tessdata_best/eng.lstm Loaded 72/72 pages (1-72) of document ./finetune_train_eng/eng.FreeSerif.exp0.lstmf 2 Percent improvement time=5, best error was 100 @ 0 At iteration 5/100/100, Mean rms=0.198%, delta=0.04%, char train=0.109%, word train=0.211%, skip ratio=0%, New best char error = 0.109 Transitioned to stage 1 wrote best model:./finetune_trained_eng-from-eng/finetune0.109_5.checkpoint wrote checkpoint. 2 Percent improvement time=5, best error was 100 @ 0 At iteration 5/200/200, Mean rms=0.17%, delta=0.02%, char train=0.055%, word train=0.105%, skip ratio=0%, New best char error = 0.055 wrote best model:./finetune_trained_eng-from-eng/finetune0.055_5.checkpoint wrote checkpoint. 2 Percent improvement time=5, best error was 100 @ 0 At iteration 5/300/300, Mean rms=0.153%, delta=0.013%, char train=0.036%, word train=0.07%, skip ratio=0%, New best char error = 0.036 wrote best model:./finetune_trained_eng-from-eng/finetune0.036_5.checkpoint wrote checkpoint. 2 Percent improvement time=5, best error was 100 @ 0 At iteration 5/400/400, Mean rms=0.142%, delta=0.01%, char train=0.027%, word train=0.053%, skip ratio=0%, New best char error = 0.027 wrote best model:./finetune_trained_eng-from-eng/finetune0.027_5.checkpoint wrote checkpoint. Finished! Error rate = 0.027 + echo '#### Building final trained file ####' #### Building final trained file #### + echo '#### stop training ####' #### stop training #### + lstmtraining --stop_training --continue_from ./finetune_trained_eng-from-eng/finetune_checkpoint --traineddata ../tessdata_best/eng.traineddata --model_output ./finetune_trained_eng-from-eng/eng-finetune.traineddata Loaded file ./finetune_trained_eng-from-eng/finetune_checkpoint, unpacking... + echo '#### eval files with./finetune_train_eng/finetune.traineddata ####' #### eval files with./finetune_train_eng/finetune.traineddata #### + lstmeval --verbosity 0 --model ./finetune_trained_eng-from-eng/eng-finetune.traineddata --eval_listfile ./finetune_eval_eng/eng.training_files.txt Loaded 72/72 pages (1-72) of document ./finetune_eval_eng/eng.Arial.exp0.lstmf Warning: LSTMTrainer deserialized an LSTMRecognizer! At iteration 0, stage 0, Eval Char error rate=0.26994052, Word error rate=0.5713608 ubuntu@tesseract-ocr:~/tess4training$
#!/bin/bash # original script by J Klein <jetm...@gmail.com> - https://pastebin.com/gNLvXkiM ################################################################ # variables to set tasks performed MakeTraining=yes MakeEval=yes RunTraining=yes ################################################################ # Language Lang=eng Continue_from_lang=eng # directory with the old 'best' language training set to continue from eg. Arabic, Latin, Devanagari #bestdata_dir=../tessdata_best/script # directory with the old 'best' language training set to continue from eg. ara, eng, san bestdata_dir=../tessdata_best # tessdata-dir which has osd.trainddata, eng.traineddata, config and tessconfigs folder and pdf.ttf tessdata_dir=../tessdata # directory with training scripts - tesstrain.sh etc. tesstrain_dir=../tesseract/src/training # downloaded directory with language data - langdata_dir=../langdata # fonts directory for this system fonts_dir=../.fonts # fonts to use for training - a minimal set for testing fonts_for_training=" \ 'FreeSerif' \ " # fonts for computing evals of best fit model fonts_for_eval=" \ 'Arial' \ " # output directories for this run train_output_dir=./finetune_train_$Continue_from_lang eval_output_dir=./finetune_eval_$Continue_from_lang trained_output_dir=./finetune_trained_$Lang-from-$Continue_from_lang # fatal bug workaround for pango #export PANGOCAIRO_BACKEND=fc if [ $MakeTraining = "yes" ]; then echo "###### MAKING TRAINING DATA ######" rm -rf $train_output_dir mkdir $train_output_dir echo "#### run tesstrain.sh ####" # the EVAL handles the quotes in the font list eval bash $tesstrain_dir/tesstrain.sh \ --lang $Lang \ --linedata_only \ --noextract_font_properties \ --exposures "0" \ --fonts_dir $fonts_dir \ --fontlist $fonts_for_training \ --langdata_dir $langdata_dir \ --tessdata_dir $tessdata_dir \ --training_text $langdata_dir/$Lang/$Lang.training_text \ --output_dir $train_output_dir echo "#### combine_tessdata to extract lstm model from 'tessdata_best' for $Continue_from_lang ####" combine_tessdata -u $bestdata_dir/$Continue_from_lang.traineddata \ $bestdata_dir/$Continue_from_lang. fi # at this point, $train_output_dir should have $Lang.FontX.exp0.lstmf # and $Lang.training_files.txt # eval data if [ $MakeEval = "yes" ]; then echo "###### MAKING EVAL DATA ######" rm -rf $eval_output_dir mkdir $eval_output_dir eval bash $tesstrain_dir/tesstrain.sh \ --fonts_dir $fonts_dir \ --fontlist $fonts_for_eval \ --lang $Lang \ --linedata_only \ --noextract_font_properties \ --langdata_dir $langdata_dir \ --tessdata_dir $tessdata_dir \ --training_text $langdata_dir/$Lang/$Lang.training_text \ --output_dir $eval_output_dir fi # at this point, $eval_output_dir should have similar files as # $train_output_dir but for different font set if [ $RunTraining = "yes" ]; then echo "#### finetune training from $bestdata_dir/$Continue_from_lang.traineddata #####" rm -rf $trained_output_dir mkdir -p $trained_output_dir lstmtraining \ --continue_from $bestdata_dir/$Continue_from_lang.lstm \ --traineddata $bestdata_dir/$Continue_from_lang.traineddata \ --max_iterations 400 \ --debug_interval 0 \ --train_listfile $train_output_dir/$Lang.training_files.txt \ --model_output $trained_output_dir/finetune echo "#### Building final trained file $best_trained_data_file ####" echo "#### stop training ####" lstmtraining \ --stop_training \ --continue_from $trained_output_dir/finetune_checkpoint \ --traineddata $bestdata_dir/$Continue_from_lang.traineddata \ --model_output $trained_output_dir/$Lang-finetune.traineddata echo "#### eval files with$train_output_dir/finetune.traineddata ####" lstmeval \ --verbosity 0 \ --model $trained_output_dir/$Lang-finetune.traineddata \ --eval_listfile $eval_output_dir/$Lang.training_files.txt fi # now $best_trained_data_file is substituted for installed
#!/bin/bash # based on bash-script by J Klein <jetm...@gmail.com> - https://pastebin.com/gNLvXkiM ################################################################ # variables to set tasks performed MakeTraining=yes MakeEval=yes RunTraining=yes ################################################################ # Language Lang=fas Continue_from_lang=fas # directory with the old 'best' training set #bestdata_dir=../tessdata_best/script bestdata_dir=../tessdata_best # tessdata directory for config files tessdata_dir=../tessdata # directory with training scripts - tesstrain.sh etc. # this is not the usual place- because they are not installed by default tesstrain_dir=../tesseract/src/training # downloaded directory with language data - langdata_dir=../langdata # fonts directory for this system fonts_dir=../.fonts # fonts to use for training - a minimal set for fast tests fonts_for_training=" \ 'Iranian Sans' \ 'Sahel' \ 'IranNastaliq-Web' \ 'Nesf2' \ 'B Koodak Bold' \ 'B Lotus' \ 'B Lotus Bold' \ 'B Nazanin' \ 'B Nazanin Bold' \ 'B Titr Bold' \ 'B Yagut' \ 'B Yagut Bold' \ 'B Yekan' \ 'B Zar' \ 'B Zar Bold' \ 'Arial Unicode MS' \ 'Tahoma' \ " # fonts for computing evals of best fit model fonts_for_eval=" \ 'B Nazanin' \ 'B Yagut' \ 'B Zar' \ " # output directories for this run train_output_dir=./plus_train_$Lang eval_output_dir=./plus_eval_$Lang trained_output_dir=./plus_trained_$Lang-from-$Continue_from_lang # fatal bug workaround for pango #export PANGOCAIRO_BACKEND=fc if [ $MakeTraining = "yes" ]; then echo "###### MAKING TRAINING DATA ######" rm -rf $train_output_dir mkdir $train_output_dir echo "#### run tesstrain.sh ####" # the EVAL handles the quotes in the font list eval bash $tesstrain_dir/tesstrain.sh \ --lang $Lang \ --linedata_only \ --noextract_font_properties \ --exposures "0" \ --fonts_dir $fonts_dir \ --fontlist $fonts_for_training \ --langdata_dir $langdata_dir \ --tessdata_dir $tessdata_dir \ --training_text $langdata_dir/$Lang/$Lang.training_text \ --output_dir $train_output_dir echo "#### combine_tessdata to extract lstm model from 'tessdata_best' for $Continue_from_lang ####" combine_tessdata -u $bestdata_dir/$Continue_from_lang.traineddata \ $bestdata_dir/$Continue_from_lang. combine_tessdata -u $tessdata_dir/$Lang.traineddata $tessdata_dir/$Lang. echo "#### build version string ####" Version_Str="$Lang:plus`date +%Y%m%d`:from:" sed -e "s/^/$Version_Str/" $bestdata_dir/$Continue_from_lang.version > $train_output_dir/$Lang.new.version echo "#### merge unicharsets to ensure all existing chars are included ####" merge_unicharsets \ $bestdata_dir/$Continue_from_lang.lstm-unicharset \ $train_output_dir/$Lang/$Lang.unicharset \ $train_output_dir/$Lang.merged.unicharset fi # at this point, $train_output_dir should have $Lang.FontX.exp0.lstmf # and $Lang.training_files.txt # eval data if [ $MakeEval = "yes" ]; then echo "###### MAKING EVAL DATA ######" rm -rf $eval_output_dir mkdir $eval_output_dir eval bash $tesstrain_dir/tesstrain.sh \ --fonts_dir $fonts_dir \ --fontlist $fonts_for_eval \ --lang $Lang \ --linedata_only \ --noextract_font_properties \ --langdata_dir $langdata_dir \ --tessdata_dir $tessdata_dir \ --training_text $langdata_dir/$Lang/$Lang.training_text \ --output_dir $eval_output_dir fi # at this point, $eval_output_dir should have similar files as # $train_output_dir but for different font set if [ $RunTraining = "yes" ]; then echo "#### rebuild starter traineddata ####" #change these flags based on language # --lang_is_rtl True \ # --pass_through_recoder True \ # combine_lang_model \ --input_unicharset $train_output_dir/$Lang/$Lang.merged.unicharset \ --script_dir $langdata_dir \ --words $langdata_dir/$Lang/$Lang.wordlist \ --numbers $langdata_dir/$Lang/$Lang.numbers \ --puncs $langdata_dir/$Lang/$Lang.punc \ --output_dir $train_output_dir \ --pass_through_recoder \ --lang_is_rtl \ --lang $Lang \ --version_str ` cat $train_output_dir/$Lang.new.version` echo "#### SHREE plus training from $bestdata_dir/$Continue_from_lang.traineddata #####" rm -rf $trained_output_dir mkdir -p $trained_output_dir lstmtraining \ --continue_from $bestdata_dir/$Continue_from_lang.lstm \ --old_traineddata $bestdata_dir/$Continue_from_lang.traineddata \ --traineddata $train_output_dir/$Lang/$Lang.traineddata \ --max_iterations 7000 \ --debug_interval 0 \ --train_listfile $train_output_dir/$Lang.training_files.txt \ --model_output $trained_output_dir/plus echo "#### Building final trained file $best_trained_data_file ####" echo "#### stop training ####" lstmtraining \ --stop_training \ --continue_from $trained_output_dir/plus_checkpoint \ --old_traineddata $bestdata_dir/$Continue_from_lang.traineddata \ --traineddata $train_output_dir/$Lang/$Lang.traineddata \ --model_output $trained_output_dir/$Lang-plus-float.traineddata cp $trained_output_dir/$Lang-plus-float.traineddata ../tessdata_best/ echo -e "\n #### eval files with $train_output_dir/$Lang-plus-float.traineddata ####" lstmeval \ --verbosity 0 \ --model $trained_output_dir/$Lang-plus-float.traineddata \ --eval_listfile $eval_output_dir/$Lang.training_files.txt fi # now $best_trained_data_file is substituted for installed