Thanks for the support, it saves lot of time and efforts. i tried the latest tesseract its working fine with the arabic text and numbers but the only issue is with arabic date, so if the issue is still open, can i prepare dataset and train a separate custom model for only numbers and date.
if possible then pls help me with the sample dataset and can i use this <https://github.com/tesseract-ocr/tesstrain> repo to train and if any apx count of dataset and iteration can be provide that will be helpful. On Monday, July 13, 2020 at 7:24:57 AM UTC+3, shree wrote: > > If I recall correctly, ara_number.traineddata has been trained for legacy > engine. You cannot use two traineddata files each using a different engine. > > Regarding training of Arabic numbers and punctuation, it is currently an > open issue. If you use the latest code from tesstrain repo it should > automatically apply bidi algorithm to handle Arabic text as well as numbers > correctly. I am not so sure about punctuation such as ( ) etc and whether > they need to be reversed or not. > > I suggest that you use the latest code from tesseract, tesstrain repo with > the latest traineddata and try. > > On Sun, Jul 12, 2020, 20:52 Eliyaz L <write2...@gmail.com <javascript:>> > wrote: > >> Hi Shree, >> >> i was using thie below version. I guess you are right its 2016 file. Let >> me test with latest traineddata. >> https://tesseract-ocr.github.io/tessdoc/Data-Files >> https://github.com/tesseract-ocr/tessdata/raw/4.00/ara.traineddata >> >> >> Meanwhile can u pls help me with arabic number. >> i tried ara_number.traineddata from here >> <https://github.com/ahmed-tea/tessdata_Arabic_Numbers/blob/master/ara_number.traineddata> >> it >> is working for number but unable to get date format with slash >> and also searched for similar issue here >> <https://github.com/tesseract-ocr/tesseract/issues/1193> here >> <https://github.com/Shreeshrii/tessdata_arabic> >> >> main problem is with date i am trying to do prediction Arabic date in the >> below format. >> >> Input image: >> >> [image: date.jpg] >> >> >> >> >> On Sunday, July 12, 2020 at 4:27:07 PM UTC+3, shree wrote: >>> >>> See https://github.com/tesseract-ocr/tesseract/issues/758 and other >>> similar issues >>> >>> On Sun, Jul 12, 2020 at 6:52 PM Shree Devi Kumar <shree...@gmail.com> >>> wrote: >>> >>>> @Eliyaz What version of tesseract are you using? Which traineddata? >>>> >>>> >Always the letter "لا" is predicted as "ال" . >>>> >>>> I think this was fixed by Ray Smiith in 2017 and should be ok in the >>>> traineddata files in tessdata_fast and tessdata_best repos. >>>> >>>> On Sun, Jul 12, 2020 at 6:45 PM Rainer Verteidiger < >>>> materialde...@gmail.com> wrote: >>>> >>>>> >>>>> Always the letter "لا" is predicted as "ال" . >>>>> >>>>> Not sure how much relevancy that bears in the context of training >>>>> models, but لا is no letter! It's a ligature ("Arabic Ligature Lam with >>>>> Alef") formed by combining ل ("Arabic Letter Lam") with ا ("Arabic Letter >>>>> Alef") whereas ال is ا followed by ل (so, the exact opposite way around; >>>>> no >>>>> ligature). Both are incredibly common in Arabic texts and although I have >>>>> no clue about machine learning, I'm surprised how the training could miss >>>>> the difference between them. >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesser...@googlegroups.com. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/de95d94b-9dcd-432c-a06c-3180d6c741afo%40googlegroups.com >>>>> >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/de95d94b-9dcd-432c-a06c-3180d6c741afo%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> >>>> >>>> -- >>>> >>>> ____________________________________________________________ >>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>> >>> >>> >>> -- >>> >>> ____________________________________________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesser...@googlegroups.com <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/3a200939-7c85-48da-bb7b-6c55724bc116o%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/3a200939-7c85-48da-bb7b-6c55724bc116o%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/841f478d-1828-428e-872c-428b00e0cad5o%40googlegroups.com.