[tesseract-ocr] How to generate .lstmf file with non-randomized lines

Ben Bongalon Mon, 04 Jan 2021 23:44:41 -0800

Hello and Happy New Year,

I am training Tesseract 4 to recognize special characters in a Philippine 
bilingual dictionary (specifically Hanunoo -> English). Following the "Fine 
Tuning" tutorial but using Spanish as starting model, I am getting good 
recognition accuracy on some characters such as eng "ŋ" but not in others.


To improve, I plan to experiment with feeding it various combinations of 
input training data sampled from the source dictionary. However I noticed 
that when *tesseract* generates an .lstmf file, it randomly picks lines 
from the training file. That is, the following command 

$ tesseract <my-TIF-file> <lstmf-name> --psm 6 lstm.train

produces a different .lstmf file when called again with the same input TIF 
file. This makes it harder to tease out if the performance difference is 
due to the quality of the training data itself, or simply a statistical 
variation as a result of how tesseract happen to have randomly chosen the 
data for the .lstmf file.

My questions:
1. How can I force tesseract not to randomize the data when generating an 
.lstmf file?
2. Is there anything I can do to minimize the effect of the randomed  
.lstmf data?

Thanks in advance.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1fad7b4f-42da-4571-9724-9762d097e80dn%40googlegroups.com.

[tesseract-ocr] How to generate .lstmf file with non-randomized lines

Reply via email to