Hi everyone, I have been playing with Tesseract for farsi language for a while. The performance of the default LSTM model is good. However, I would like to know if I can further improve it. So I tried to develop it from scratch since I face with some unicahr errors.
before I talk about my problem with training, I have some general question which was not explain in tesseract paper and documents or at least I couldn't find it. *First question:* I want to know more about the features of data that tesseract trained on it. Are there any differences between this data on tesseract 5 and 4? are they just line? are they contain noise? Is there any connection and dependency between the word of each line? *second question:* After I searched, I found that the default batch size is 1. Does it mean that the tesseract 5 trained with batch size 1? How can I change it? *third question: * As I didn't get high accuracy, I decided to fine-tune the fas model by using *START_MODEL* command. But when I checked the *lstmtraining --help*, I found *continue_from* command, and now I am confused about what command I should use for fine-tuning. *forth question:* I am training tesseract version 5.2.0 from scratch with about 40000 data and 3 new fonts for the Farsi language. Although it seems that every step is correct, I got a high error rate BCER starting from 99.76 to 91.93 after 10000 iterations. I want to know the reason behind this poor CER I got? !export OMP_THREAD_LIMIT=16 !make training \ START_MODEL=fas \ MODEL_NAME=dori \ LANG_TYPE=RTL \ LANG_CODE=fas \ TESSDATA=/usr/share/tesseract-ocr/5/tessdata \ DATA_DIR=../data \ MAX_ITERATIONS=10000 ` -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ef1ba881-69c1-4e2c-8a6e-9fadd7b27322n%40googlegroups.com.