[tesseract-ocr] Traineddata distorted and provides bad read, last trained sample is as usual

Mitya Sat, 05 Apr 2025 07:03:55 -0700

I've trained 8000 samples with set of commands below:


echo "~/source/source.lstmf" > /home/j/trainingCurrentEng/data/list.eval
echo "~/source/source.lstmf" > /home/j/trainingCurrentEng/data/list.train


lstmtraining   --continue_from 
/home/j/trainingCurrentEng/data/checkpoints/eng_trained_checkpoint 
--traineddata /home/j/trainingCurrentEng/data/eng.traineddata --traineddata 
/home/j/trainingCurrentEng/data/eng.traineddata --train_listfile 
/home/j/trainingCurrentEng/data/list.train --eval_listfile 
/home/j/trainingCurrentEng/data/list.eval --model_output 
/home/j/trainingCurrentEng/data/checkpoints/eng_trained --learning_rate 
0.001 --debug_interval 10 --max_iterations 8000000





lstmtraining --stop_training   --continue_from 
/home/j/trainingCurrentEng/data/checkpoints/eng_trained_checkpoint   
--traineddata /home/j/trainingCurrentEng/data/eng.traineddata   
--model_output /home/j/trainingCurrentEng/data/eng_trained.traineddata

*IMPORTANT/RESULT:*
 tesseract source.tiff output_text -l eng --tessdata-dir 
/home/j/trainingCurrentEng/data --psm 7
 cat output_text.txt

*abcdef*

 tesseract source.tiff output_text_1 -l eng_trained --tessdata-dir 
/home/j/trainingCurrentEng/data --psm 7
 cat output_text_1.txt

*laldlfk*


*Question:*
Syntax one looks better, but after 8000 results I got Tesseract eng_trained 
model distorted, so it reads completely wrong
But If you read THE LAST sample trained/updated eng_trained, it reads this 
exact data flawlessly

What am I doing wrong? How to fix current syntax?




*IMPORTANT*: I use images in same color pallete: black background 
white(close to gray) font, without any masks applied.



j@j-Aspire-A515-58M:~/source$ ls
source.box  source.lstmf  source.tiff  source.txt  unicharset


Source Box:

a 251 52 355 178 0
b 356 51 444 176 0
c 446 22 530 175 0
d 534 22 622 173 0
e 626 60 766 174 0
f 768 59 870 173 0

source.txt
*abcdef*

unicharset
9
NULL 0 Common 0
Joined 7 0,255,0,255,0,0,0,0,0,0 Latin 1 0 1 Joined # Joined [4a 6f 69 6e 
65 64 ]a
|Broken|0|1 15 0,255,0,255,0,0,0,0,0,0 Common 2 10 2 |Broken|0|1    # Broken
a 3 0,255,0,255,0,0,0,0,0,0 Latin 3 0 3 a   # a [6f ]a
b 3 0,255,0,255,0,0,0,0,0,0 Latin 4 0 4 b   # b [65 ]a
c 3 0,255,0,255,0,0,0,0,0,0 Latin 5 0 5 c   # c [64 ]a
d 3 0,255,0,255,0,0,0,0,0,0 Latin 6 0 6 d   # d [6b ]a
e 3 0,255,0,255,0,0,0,0,0,0 Latin 7 0 7 e   # e [6d ]a
f 3 0,255,0,255,0,0,0,0,0,0 Latin 8 0 8 f   # f [63 ]a





-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/63ee7679-8c44-4fbc-be44-63bdb5920e7an%40googlegroups.com.

[tesseract-ocr] Traineddata distorted and provides bad read, last trained sample is as usual

Reply via email to