Hi,
I am trying to improve the model for Arabic language. Unfortunately, the 
results are not good enough. Probably I've been going in wrong directions. 
So I would like to have ideas from the community.

*1. Training text:*
The training text 
<https://github.com/tesseract-ocr/langdata_lstm/blob/main/ara/ara.training_text>
 
used for training Arabic is very small: 80 lines vs 193k for english (see 
langdata_lstm/issues/6 
<https://github.com/tesseract-ocr/langdata_lstm/issues/6>), and does not 
contain all characters. So I tried to prepare a larger dataset.
- Concerning arabic diacritics 
<https://en.wikipedia.org/wiki/Arabic_diacritics>, should the training text 
contain all combinations of letters and diacritics ? For example, if the 
training text doesn't have the combination بَ (letter ب + *fatha*), can the 
model recognise it after training ?
- How to regenerate the files ara.punc, ara.numbers, ara.wordlist, 
ara.config, ara.unicharset... ?
- By the way, most of files in 
https://github.com/tesseract-ocr/langdata_lstm didn't change since 5 years 
ago. Is it open to contributions ? Is is possible to retrain some languages 
?

*2. Ground truth:*
- I used text2image to generate 1k text lines multiplied by ~20 fonts = 
total of 20k images. Is it enough ?

*3. Box files:* (ref 
https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html)
- Should I use the box files generated by text2image ? or the WordStr 
format since it's a  right-to-left language ?

*4. Training:*
I started from the existing model (START_MODEL=ara). Number of iterations 
is 20k. Is it enough ?

If you have any other suggestions/remarks, please share.
Thanks !

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/0a0b5999-a676-4f8a-abb8-ffb5c61bad73n%40googlegroups.com.

Reply via email to