I am working with Tesseract OCR and want to experiment with different 
binarization methods, such as Otsu's thresholding and other custom filters, 
to improve text recognition accuracy.

However, I am concerned that training with these different preprocessing 
techniques might modify or overwrite eng.traineddata, which I want to keep 
intact.

*My questions are:*
Does training a new model affect the existing eng.traineddata file? How can 
I safely train Tesseract with new filters without modifying the default 
English model? Is there a recommended approach to train Tesseract on 
preprocessed images while keeping eng.traineddata unchanged? 

*What I've tried:*

updated my current eng_new.traineddata with three samples, each sample had 
applied filter Otsu, Otsu_Tresh_Binary, Otsu_Tresh_Binary_Inv After first 
1000 iterations I got difference between initial and target trained.data 
But target trained.data got slightly worse results.
lstmtraining --continue_from 
/home/j/trainingCurrentEng/data/checkpoints/eng_trained --traineddata 
/home/j/trainingCurrentEng/data/eng.traineddata --train_listfile 
/home/j/trainingCurrentEng/data/list.train --eval_listfile 
/home/j/trainingCurrentEng/data/list.eval --model_output 
/home/j/trainingCurrentEng/data/checkpoints/eng_trained --learning_rate 
0.0001 --debug_interval 10 --max_iterations 600 tesseract 
otsu_tresh_binary_inv.tiff output_text -l eng --tessdata-dir 
/home/j/trainingCurrentEng/data --psm 7 

cat output_text.txt

Abcd123
tesseract otsu_tresh_binary_inv.tiff output_text_1 -l eng_trained 
--tessdata-dir /home/j/trainingCurrentEng/data --psm 7 

cat output_text_1.txt Abc

I would appreciate any guidance or best practices for training custom 
models without interfering with existing ones.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/78672437-a384-4d9a-b24a-7e9167aa285bn%40googlegroups.com.

Reply via email to