@Eliyaz I do not know Arabic or any other RTL.
I suggest you try running training with the latest code and tesstrain. You
may have to experiment to get the best result.

I will try to do a test run with the data you provided, does it include
numbers and dates?

On Tue, Jul 14, 2020, 13:18 Eliyaz L <write2eli...@gmail.com> wrote:

> Hi sorry to bother, just a follow up.
>
> i tried the latest tesseract its working fine with the arabic text and
> numbers but the only issue is with arabic date,
> so if the issue is still open, can i prepare dataset and train a separate
> custom model for only numbers and date.
>
> if possible then pls help me with the sample dataset and can i use this
> <https://github.com/tesseract-ocr/tesstrain> repo to train and if any apx
> count of dataset and iteration can be provide that will be helpful.
>
> On Monday, July 13, 2020 at 11:55:41 AM UTC+3, Eliyaz L wrote:
>>
>> Thanks for the support, it saves lot of time and efforts.
>>
>> i tried the latest tesseract its working fine with the arabic text and
>> numbers but the only issue is with arabic date,
>> so if the issue is still open, can i prepare dataset and train a separate
>> custom model for only numbers and date.
>>
>> if possible then pls help me with the sample dataset and can i use this
>> <https://github.com/tesseract-ocr/tesstrain> repo to train and if any
>> apx count of dataset and iteration can be provide that will be helpful.
>>
>> On Monday, July 13, 2020 at 7:24:57 AM UTC+3, shree wrote:
>>>
>>> If I recall correctly, ara_number.traineddata has been trained for
>>> legacy engine. You cannot use two traineddata files each using a different
>>> engine.
>>>
>>> Regarding training of Arabic numbers and punctuation, it is currently an
>>> open issue. If you use the latest code from tesstrain repo it should
>>> automatically apply bidi algorithm to handle Arabic text as well as numbers
>>> correctly. I am not so sure about punctuation such as ( ) etc and whether
>>> they need to be reversed or not.
>>>
>>> I suggest that you use the latest code from tesseract, tesstrain repo
>>> with the latest traineddata and try.
>>>
>>> On Sun, Jul 12, 2020, 20:52 Eliyaz L <write2...@gmail.com> wrote:
>>>
>>>> Hi Shree,
>>>>
>>>> i was using thie below version. I guess you are right its 2016 file.
>>>> Let me test with latest traineddata.
>>>> https://tesseract-ocr.github.io/tessdoc/Data-Files
>>>> https://github.com/tesseract-ocr/tessdata/raw/4.00/ara.traineddata
>>>>
>>>>
>>>> Meanwhile can u pls help me with arabic number.
>>>> i tried ara_number.traineddata from here
>>>> <https://github.com/ahmed-tea/tessdata_Arabic_Numbers/blob/master/ara_number.traineddata>
>>>>  it
>>>> is working for number but unable to get date format with slash
>>>> and also searched for similar issue here
>>>> <https://github.com/tesseract-ocr/tesseract/issues/1193> here
>>>> <https://github.com/Shreeshrii/tessdata_arabic>
>>>>
>>>> main problem is with date i am trying to do prediction Arabic date in
>>>> the below format.
>>>>
>>>> Input image:
>>>>
>>>> [image: date.jpg]
>>>>
>>>>
>>>>
>>>>
>>>> On Sunday, July 12, 2020 at 4:27:07 PM UTC+3, shree wrote:
>>>>>
>>>>> See https://github.com/tesseract-ocr/tesseract/issues/758 and other
>>>>> similar issues
>>>>>
>>>>> On Sun, Jul 12, 2020 at 6:52 PM Shree Devi Kumar <shree...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> @Eliyaz What version of tesseract are you using? Which traineddata?
>>>>>>
>>>>>> >Always the letter "لا" is predicted as "ال" .
>>>>>>
>>>>>> I think this was fixed by Ray Smiith in 2017 and should be ok in the
>>>>>> traineddata files in tessdata_fast and tessdata_best repos.
>>>>>>
>>>>>> On Sun, Jul 12, 2020 at 6:45 PM Rainer Verteidiger <
>>>>>> materialde...@gmail.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> Always the letter "لا" is predicted as "ال" .
>>>>>>>
>>>>>>> Not sure how much relevancy that bears in the context of training
>>>>>>> models, but لا is no letter! It's a ligature ("Arabic Ligature Lam with
>>>>>>> Alef") formed by combining ل ("Arabic Letter Lam") with ا ("Arabic 
>>>>>>> Letter
>>>>>>> Alef") whereas ال is ا followed by ل (so, the exact opposite way 
>>>>>>> around; no
>>>>>>> ligature). Both are incredibly common in Arabic texts and although I 
>>>>>>> have
>>>>>>> no clue about machine learning, I'm surprised how the training could 
>>>>>>> miss
>>>>>>> the difference between them.
>>>>>>>
>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to tesser...@googlegroups.com.
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/de95d94b-9dcd-432c-a06c-3180d6c741afo%40googlegroups.com
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/de95d94b-9dcd-432c-a06c-3180d6c741afo%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> ____________________________________________________________
>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> ____________________________________________________________
>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesser...@googlegroups.com.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/3a200939-7c85-48da-bb7b-6c55724bc116o%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/3a200939-7c85-48da-bb7b-6c55724bc116o%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/9e24724e-5af7-4ea2-9a5f-baae731e2e14o%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/9e24724e-5af7-4ea2-9a5f-baae731e2e14o%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXYur6EY%2BRFy5PW%2BPNUXTyh79Z9W6rSjQdOqAEGDyRWEQ%40mail.gmail.com.

Reply via email to