[tesseract-ocr] General guidelines for training Arabic

Wael TELLAT Tue, 15 Aug 2023 21:17:43 -0700

Hi,
I am trying to improve the model for Arabic language. Unfortunately, the 
results are not good enough. Probably I've been going in wrong directions. 
So I would like to have ideas from the community.

*1. Training text:*
The training text
<https://github.com/tesseract-ocr/langdata_lstm/blob/main/ara/ara.training_text>

used for training Arabic is very small: 80 lines vs 193k for english (see
langdata_lstm/issues/6
<https://github.com/tesseract-ocr/langdata_lstm/issues/6>), and does not
contain all characters. So I tried to prepare a larger dataset.
- Concerning arabic diacritics
<https://en.wikipedia.org/wiki/Arabic_diacritics>, should the training text
contain all combinations of letters and diacritics ? For example, if the
training text doesn't have the combination بَ (letter ب + *fatha*), can the
model recognise it after training ?
- How to regenerate the files ara.punc, ara.numbers, ara.wordlist,
ara.config, ara.unicharset... ?
- By the way, most of files in
https://github.com/tesseract-ocr/langdata_lstm didn't change since 5 years
ago. Is it open to contributions ? Is is possible to retrain some languages
?

*2. Ground truth:*
- I used text2image to generate 1k text lines multiplied by ~20 fonts =
total of 20k images. Is it enough ?

*3. Box files:* (ref
https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html)
- Should I use the box files generated by text2image ? or the WordStr
format since it's a right-to-left language ?

*4. Training:*
I started from the existing model (START_MODEL=ara). Number of iterations
is 20k. Is it enough ?

If you have any other suggestions/remarks, please share.
Thanks !

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/0a0b5999-a676-4f8a-abb8-ffb5c61bad73n%40googlegroups.com.

[tesseract-ocr] General guidelines for training Arabic

Reply via email to