[tesseract-ocr] Re: General guidelines for training Arabic

Des Bw Fri, 08 Sep 2023 02:46:25 -0700

I am also starting up with Tesseract; and not an expert by no means. 
But, from what I learned from reading in various places: it might good for 
you to increase the number of lines to get better results. The iterations 
are sufficient for the first round. You can increase them step by step.



On Wednesday, August 16, 2023 at 7:17:49 AM UTC+3 [email protected] wrote:

> Hi,
> I am trying to improve the model for Arabic language. Unfortunately, the 
> results are not good enough. Probably I've been going in wrong directions. 
> So I would like to have ideas from the community.
>
> *1. Training text:*
> The training text 
> <https://github.com/tesseract-ocr/langdata_lstm/blob/main/ara/ara.training_text>
>  
> used for training Arabic is very small: 80 lines vs 193k for english (see 
> langdata_lstm/issues/6 
> <https://github.com/tesseract-ocr/langdata_lstm/issues/6>), and does not 
> contain all characters. So I tried to prepare a larger dataset.
> - Concerning arabic diacritics 
> <https://en.wikipedia.org/wiki/Arabic_diacritics>, should the training 
> text contain all combinations of letters and diacritics ? For example, if 
> the training text doesn't have the combination بَ (letter ب + *fatha*), 
> can the model recognise it after training ?
> - How to regenerate the files ara.punc, ara.numbers, ara.wordlist, 
> ara.config, ara.unicharset... ?
> - By the way, most of files in 
> https://github.com/tesseract-ocr/langdata_lstm didn't change since 5 
> years ago. Is it open to contributions ? Is is possible to retrain some 
> languages ?
>
> *2. Ground truth:*
> - I used text2image to generate 1k text lines multiplied by ~20 fonts = 
> total of 20k images. Is it enough ?
>
> *3. Box files:* (ref 
> https://tesseract-ocr.github.io/tessdoc/tess4/Make-Box-Files.html)
> - Should I use the box files generated by text2image ? or the WordStr 
> format since it's a  right-to-left language ?
>
> *4. Training:*
> I started from the existing model (START_MODEL=ara). Number of iterations 
> is 20k. Is it enough ?
>
> If you have any other suggestions/remarks, please share.
> Thanks !
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4bb2ad90-909a-43aa-b446-23e6a1a1ac2dn%40googlegroups.com.

[tesseract-ocr] Re: General guidelines for training Arabic

Reply via email to