Hi, I am trying to improve the model for Arabic language. Unfortunately, the results are not good enough. Probably I've been going in wrong directions. So I would like to have ideas from the community.
*1. Training text:* The training text <https://github.com/tesseract-ocr/langdata_lstm/blob/main/ara/ara.training_text> used for training Arabic is very small: 80 lines vs 193k for english (see langdata_lstm/issues/6 <https://github.com/tesseract-ocr/langdata_lstm/issues/6>), and does not contain all characters. So I tried to prepare a larger dataset. - Concerning arabic diacritics <https://en.wikipedia.org/wiki/Arabic_diacritics>, should the training text contain all combinations of letters and diacritics ? For example, if the training text doesn't have the combination بَ (letter ب + *fatha*), can the model recognise it after training ? - How to regenerate the files ara.punc, ara.numbers, ara.wordlist, ara.config, ara.unicharset... ? - By the way, most of files in https://github.com/tesseract-ocr/langdata_lstm didn't change since 5 years ago. Is it open to contributions ? Is is possible to retrain some languages ? *2. Ground truth:* - I used text2image to generate 1k text lines multiplied by ~20 fonts = total of 20k images. Is it enough ? *3. Box files:* (ref https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html) - Should I use the box files generated by text2image ? or the WordStr format since it's a right-to-left language ? *4. Training:* I started from the existing model (START_MODEL=ara). Number of iterations is 20k. Is it enough ? If you have any other suggestions/remarks, please share. Thanks ! -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0a0b5999-a676-4f8a-abb8-ffb5c61bad73n%40googlegroups.com.