Hi sorry to bother, just a follow up.

i tried the latest tesseract its working fine with the arabic text and 
numbers but the only issue is with arabic date,
so if the issue is still open, can i prepare dataset and train a separate 
custom model for only numbers and date.

if possible then pls help me with the sample dataset and can i use this 
<https://github.com/tesseract-ocr/tesstrain> repo to train and if any apx 
count of dataset and iteration can be provide that will be helpful.  

On Monday, July 13, 2020 at 11:55:41 AM UTC+3, Eliyaz L wrote:
>
> Thanks for the support, it saves lot of time and efforts.
>
> i tried the latest tesseract its working fine with the arabic text and 
> numbers but the only issue is with arabic date,
> so if the issue is still open, can i prepare dataset and train a separate 
> custom model for only numbers and date.
>
> if possible then pls help me with the sample dataset and can i use this 
> <https://github.com/tesseract-ocr/tesstrain> repo to train and if any apx 
> count of dataset and iteration can be provide that will be helpful.  
>
> On Monday, July 13, 2020 at 7:24:57 AM UTC+3, shree wrote:
>>
>> If I recall correctly, ara_number.traineddata has been trained for 
>> legacy engine. You cannot use two traineddata files each using a different 
>> engine.
>>
>> Regarding training of Arabic numbers and punctuation, it is currently an 
>> open issue. If you use the latest code from tesstrain repo it should 
>> automatically apply bidi algorithm to handle Arabic text as well as numbers 
>> correctly. I am not so sure about punctuation such as ( ) etc and whether 
>> they need to be reversed or not.
>>
>> I suggest that you use the latest code from tesseract, tesstrain repo 
>> with the latest traineddata and try.
>>
>> On Sun, Jul 12, 2020, 20:52 Eliyaz L <write2...@gmail.com> wrote:
>>
>>> Hi Shree,
>>>
>>> i was using thie below version. I guess you are right its 2016 file. Let 
>>> me test with latest traineddata. 
>>> https://tesseract-ocr.github.io/tessdoc/Data-Files
>>> https://github.com/tesseract-ocr/tessdata/raw/4.00/ara.traineddata
>>>
>>>
>>> Meanwhile can u pls help me with arabic number.
>>> i tried ara_number.traineddata from here 
>>> <https://github.com/ahmed-tea/tessdata_Arabic_Numbers/blob/master/ara_number.traineddata>
>>>  it 
>>> is working for number but unable to get date format with slash
>>> and also searched for similar issue here 
>>> <https://github.com/tesseract-ocr/tesseract/issues/1193> here 
>>> <https://github.com/Shreeshrii/tessdata_arabic>
>>>
>>> main problem is with date i am trying to do prediction Arabic date in 
>>> the below format.
>>>
>>> Input image: 
>>>
>>> [image: date.jpg]
>>>
>>>
>>>
>>>
>>> On Sunday, July 12, 2020 at 4:27:07 PM UTC+3, shree wrote:
>>>>
>>>> See https://github.com/tesseract-ocr/tesseract/issues/758 and other 
>>>> similar issues
>>>>
>>>> On Sun, Jul 12, 2020 at 6:52 PM Shree Devi Kumar <shree...@gmail.com> 
>>>> wrote:
>>>>
>>>>> @Eliyaz What version of tesseract are you using? Which traineddata?
>>>>>
>>>>> >Always the letter "لا" is predicted as "ال" .
>>>>>
>>>>> I think this was fixed by Ray Smiith in 2017 and should be ok in the 
>>>>> traineddata files in tessdata_fast and tessdata_best repos.
>>>>>
>>>>> On Sun, Jul 12, 2020 at 6:45 PM Rainer Verteidiger <
>>>>> materialde...@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>> Always the letter "لا" is predicted as "ال" .
>>>>>>
>>>>>> Not sure how much relevancy that bears in the context of training 
>>>>>> models, but لا is no letter! It's a ligature ("Arabic Ligature Lam with 
>>>>>> Alef") formed by combining ل ("Arabic Letter Lam") with ا ("Arabic 
>>>>>> Letter 
>>>>>> Alef") whereas ال is ا followed by ل (so, the exact opposite way around; 
>>>>>> no 
>>>>>> ligature). Both are incredibly common in Arabic texts and although I 
>>>>>> have 
>>>>>> no clue about machine learning, I'm surprised how the training could 
>>>>>> miss 
>>>>>> the difference between them.
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to tesser...@googlegroups.com.
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/de95d94b-9dcd-432c-a06c-3180d6c741afo%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/de95d94b-9dcd-432c-a06c-3180d6c741afo%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>>
>>>>> ____________________________________________________________
>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>
>>>>
>>>>
>>>> -- 
>>>>
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesser...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/3a200939-7c85-48da-bb7b-6c55724bc116o%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/3a200939-7c85-48da-bb7b-6c55724bc116o%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/9e24724e-5af7-4ea2-9a5f-baae731e2e14o%40googlegroups.com.

Reply via email to