Hi everyone,

I have been playing with Tesseract for farsi language for a while. The 
performance of the default LSTM model is good. However, I would like to 
know if I can further improve it. So I tried to develop it from scratch 
since I face with some unicahr errors.

before I talk about my problem with training, I have some general question 
which was not explain in tesseract paper and documents or at least I 
couldn't find it.

*First question:*
I want to know more about the features of data that tesseract trained on 
it. Are there any differences between this data on tesseract 5 and 4? are 
they just line? are they contain noise? Is there any connection and 
dependency between the word of each line?  

*second question:*
After I searched, I found that the default batch size is 1. Does it mean 
that the tesseract 5 trained with batch size 1? How can I change it?

*third question: *
As I didn't get high accuracy, I decided to fine-tune the fas model by 
using *START_MODEL* command. But when I checked the *lstmtraining --help*, 
I found *continue_from* command, and now I am confused about what command I 
should use for fine-tuning.  

*forth question:*
I am training tesseract version 5.2.0 from scratch with about 40000 data 
and 3 new fonts for the Farsi language. Although it seems that every step 
is correct, I got a high error rate BCER starting from 99.76 to 91.93 after 
10000 iterations.
 I want to know the reason behind this poor CER I got?

!export OMP_THREAD_LIMIT=16 
 !make training \ 
 START_MODEL=fas \
 MODEL_NAME=dori \
 LANG_TYPE=RTL \
 LANG_CODE=fas \
 TESSDATA=/usr/share/tesseract-ocr/5/tessdata \
 DATA_DIR=../data \ MAX_ITERATIONS=10000 `  

  

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ef1ba881-69c1-4e2c-8a6e-9fadd7b27322n%40googlegroups.com.

Reply via email to