@Eliyaz I do not know Arabic or any other RTL. I suggest you try running training with the latest code and tesstrain. You may have to experiment to get the best result.
I will try to do a test run with the data you provided, does it include numbers and dates? On Tue, Jul 14, 2020, 13:18 Eliyaz L <write2eli...@gmail.com> wrote: > Hi sorry to bother, just a follow up. > > i tried the latest tesseract its working fine with the arabic text and > numbers but the only issue is with arabic date, > so if the issue is still open, can i prepare dataset and train a separate > custom model for only numbers and date. > > if possible then pls help me with the sample dataset and can i use this > <https://github.com/tesseract-ocr/tesstrain> repo to train and if any apx > count of dataset and iteration can be provide that will be helpful. > > On Monday, July 13, 2020 at 11:55:41 AM UTC+3, Eliyaz L wrote: >> >> Thanks for the support, it saves lot of time and efforts. >> >> i tried the latest tesseract its working fine with the arabic text and >> numbers but the only issue is with arabic date, >> so if the issue is still open, can i prepare dataset and train a separate >> custom model for only numbers and date. >> >> if possible then pls help me with the sample dataset and can i use this >> <https://github.com/tesseract-ocr/tesstrain> repo to train and if any >> apx count of dataset and iteration can be provide that will be helpful. >> >> On Monday, July 13, 2020 at 7:24:57 AM UTC+3, shree wrote: >>> >>> If I recall correctly, ara_number.traineddata has been trained for >>> legacy engine. You cannot use two traineddata files each using a different >>> engine. >>> >>> Regarding training of Arabic numbers and punctuation, it is currently an >>> open issue. If you use the latest code from tesstrain repo it should >>> automatically apply bidi algorithm to handle Arabic text as well as numbers >>> correctly. I am not so sure about punctuation such as ( ) etc and whether >>> they need to be reversed or not. >>> >>> I suggest that you use the latest code from tesseract, tesstrain repo >>> with the latest traineddata and try. >>> >>> On Sun, Jul 12, 2020, 20:52 Eliyaz L <write2...@gmail.com> wrote: >>> >>>> Hi Shree, >>>> >>>> i was using thie below version. I guess you are right its 2016 file. >>>> Let me test with latest traineddata. >>>> https://tesseract-ocr.github.io/tessdoc/Data-Files >>>> https://github.com/tesseract-ocr/tessdata/raw/4.00/ara.traineddata >>>> >>>> >>>> Meanwhile can u pls help me with arabic number. >>>> i tried ara_number.traineddata from here >>>> <https://github.com/ahmed-tea/tessdata_Arabic_Numbers/blob/master/ara_number.traineddata> >>>> it >>>> is working for number but unable to get date format with slash >>>> and also searched for similar issue here >>>> <https://github.com/tesseract-ocr/tesseract/issues/1193> here >>>> <https://github.com/Shreeshrii/tessdata_arabic> >>>> >>>> main problem is with date i am trying to do prediction Arabic date in >>>> the below format. >>>> >>>> Input image: >>>> >>>> [image: date.jpg] >>>> >>>> >>>> >>>> >>>> On Sunday, July 12, 2020 at 4:27:07 PM UTC+3, shree wrote: >>>>> >>>>> See https://github.com/tesseract-ocr/tesseract/issues/758 and other >>>>> similar issues >>>>> >>>>> On Sun, Jul 12, 2020 at 6:52 PM Shree Devi Kumar <shree...@gmail.com> >>>>> wrote: >>>>> >>>>>> @Eliyaz What version of tesseract are you using? Which traineddata? >>>>>> >>>>>> >Always the letter "لا" is predicted as "ال" . >>>>>> >>>>>> I think this was fixed by Ray Smiith in 2017 and should be ok in the >>>>>> traineddata files in tessdata_fast and tessdata_best repos. >>>>>> >>>>>> On Sun, Jul 12, 2020 at 6:45 PM Rainer Verteidiger < >>>>>> materialde...@gmail.com> wrote: >>>>>> >>>>>>> >>>>>>> Always the letter "لا" is predicted as "ال" . >>>>>>> >>>>>>> Not sure how much relevancy that bears in the context of training >>>>>>> models, but لا is no letter! It's a ligature ("Arabic Ligature Lam with >>>>>>> Alef") formed by combining ل ("Arabic Letter Lam") with ا ("Arabic >>>>>>> Letter >>>>>>> Alef") whereas ال is ا followed by ل (so, the exact opposite way >>>>>>> around; no >>>>>>> ligature). Both are incredibly common in Arabic texts and although I >>>>>>> have >>>>>>> no clue about machine learning, I'm surprised how the training could >>>>>>> miss >>>>>>> the difference between them. >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to tesser...@googlegroups.com. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/de95d94b-9dcd-432c-a06c-3180d6c741afo%40googlegroups.com >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/de95d94b-9dcd-432c-a06c-3180d6c741afo%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> ____________________________________________________________ >>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> ____________________________________________________________ >>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesser...@googlegroups.com. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/3a200939-7c85-48da-bb7b-6c55724bc116o%40googlegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/3a200939-7c85-48da-bb7b-6c55724bc116o%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/9e24724e-5af7-4ea2-9a5f-baae731e2e14o%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/9e24724e-5af7-4ea2-9a5f-baae731e2e14o%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXYur6EY%2BRFy5PW%2BPNUXTyh79Z9W6rSjQdOqAEGDyRWEQ%40mail.gmail.com.