[tesseract-ocr] Training with a large number of LSTMF files

ProgressNotPerfection Tue, 11 Sep 2018 05:57:46 -0700

Hi Tesseract Group
I am trying to train tesseract to recognize handwritten characters and have 
prepared several thousand lstmf files (from tif/box sets) so I can finetune 
best trained eng.traineddata, I read elsewhere on this forum that a low 
number (say 300 - 400) if iterations is recommended when finetuning to 
avoid overfitting. In my case though it appears that if I choose a low 
number of iterations, only (approximately) that number of lstmf files get 
loaded by the training process. I assumed that each iteration would be a 
training pass over all the lstmf files. Below is my script (which assumes 
my lstmf files are ready in trained_output_dir). How should I amend this so 
that it loads all my lstmf files? Should the number of iterations be 
greater than the number of lstmf files? ... or is there a maximum number of 
lstmf files that can used for training at once?


Any help would be much appreciated
Thanks

#! /bin/bash
#####################################################
# Script to finetune a language traineddata file for a set of
# pre built lstmf files and a starter traineddata
# for tesseract4.0.0-beta
# Modify directory paths and filenames as required for your setup.
#####################################################

Lang=eng
bestdata_dir=~/tesseract-ocr/tessdata_best
tesstrain_dir=~/tesseract-ocr/src/training
trained_output_dir=~/tesseract-ocr/src/training/eng-finetune-impact

echo "###### EXTRACT BEST LSTM MODEL ######"
combine_tessdata -e $bestdata_dir/$Lang.traineddata $bestdata_dir/$Lang.lstm

echo "###### LSTM TRAINING ######"
echo "#### running lstmtraining for finetuning from 
$bestdata_dir/$Lang.traineddata #####"

lstmtraining \
--continue_from  $bestdata_dir/$Lang.lstm \
--net_spec '[1,49,0,1 Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c78]' \
--old_traineddata  $bestdata_dir/$Lang.traineddata \
--traineddata    $trained_output_dir/$Lang/$Lang.traineddata \
--max_iterations 400 \
--debug_interval 0 \
--train_listfile $trained_output_dir/$Lang.training_files.txt \
--model_output  $trained_output_dir/finetune

echo "###### BUILD FINETUNED MODEL ######"
echo "#### Building final trained file $Lang-finetune-$Lang.traineddata  
####"
lstmtraining \
--stop_training \
--continue_from $trained_output_dir/finetune_checkpoint \
--old_traineddata  $bestdata_dir/$Lang.traineddata \
--traineddata    $trained_output_dir/$Lang/$Lang.traineddata \
--model_output "$trained_output_dir/$Lang-finetune-$Lang.traineddata"



-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2ccbe310-2cc1-4ee9-b724-e1551d0e7daf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Training with a large number of LSTMF files

Reply via email to