Hi Tesseract Group I am trying to train tesseract to recognize handwritten characters and have prepared several thousand lstmf files (from tif/box sets) so I can finetune best trained eng.traineddata, I read elsewhere on this forum that a low number (say 300 - 400) if iterations is recommended when finetuning to avoid overfitting. In my case though it appears that if I choose a low number of iterations, only (approximately) that number of lstmf files get loaded by the training process. I assumed that each iteration would be a training pass over all the lstmf files. Below is my script (which assumes my lstmf files are ready in trained_output_dir). How should I amend this so that it loads all my lstmf files? Should the number of iterations be greater than the number of lstmf files? ... or is there a maximum number of lstmf files that can used for training at once?
Any help would be much appreciated Thanks #! /bin/bash ##################################################### # Script to finetune a language traineddata file for a set of # pre built lstmf files and a starter traineddata # for tesseract4.0.0-beta # Modify directory paths and filenames as required for your setup. ##################################################### Lang=eng bestdata_dir=~/tesseract-ocr/tessdata_best tesstrain_dir=~/tesseract-ocr/src/training trained_output_dir=~/tesseract-ocr/src/training/eng-finetune-impact echo "###### EXTRACT BEST LSTM MODEL ######" combine_tessdata -e $bestdata_dir/$Lang.traineddata $bestdata_dir/$Lang.lstm echo "###### LSTM TRAINING ######" echo "#### running lstmtraining for finetuning from $bestdata_dir/$Lang.traineddata #####" lstmtraining \ --continue_from $bestdata_dir/$Lang.lstm \ --net_spec '[1,49,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c78]' \ --old_traineddata $bestdata_dir/$Lang.traineddata \ --traineddata $trained_output_dir/$Lang/$Lang.traineddata \ --max_iterations 400 \ --debug_interval 0 \ --train_listfile $trained_output_dir/$Lang.training_files.txt \ --model_output $trained_output_dir/finetune echo "###### BUILD FINETUNED MODEL ######" echo "#### Building final trained file $Lang-finetune-$Lang.traineddata ####" lstmtraining \ --stop_training \ --continue_from $trained_output_dir/finetune_checkpoint \ --old_traineddata $bestdata_dir/$Lang.traineddata \ --traineddata $trained_output_dir/$Lang/$Lang.traineddata \ --model_output "$trained_output_dir/$Lang-finetune-$Lang.traineddata" -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2ccbe310-2cc1-4ee9-b724-e1551d0e7daf%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.