[tesseract-ocr] Tesseract-OCR Training Arabic text & numbers

2020-07-12 Thread Eliyaz L
Hi, My use case is on Arabic document, the pre retrained ara.traineddata are good but not perfect. so i wish to fine tune ara.traineddata, if the results are not satisfying then have train my own custom data. please suggest me for the following: 1. for my use case in Arabic text, proble

Re: [tesseract-ocr] Tesseract-OCR Training Arabic text & numbers

2020-07-12 Thread Shree Devi Kumar
What character are you trying to add? Please share the training data to try and replicate the issue. On Sun, Jul 12, 2020, 15:35 Eliyaz L wrote: > Hi, > > > My use case is on Arabic document, the pre retrained ara.traineddata are > good but not perfect. so i wish to fine tune ara.traineddata, i

Re: [tesseract-ocr] Tesseract-OCR Training Arabic text & numbers

2020-07-12 Thread Eliyaz L
Always the letter "لا" is predicted as "ال" . My training data here My prediction document will be in Traditional Arabic font here . Below shell command u

Re: [tesseract-ocr] Tesseract-OCR Training Arabic text & numbers

2020-07-12 Thread Rainer Verteidiger
Always the letter "لا" is predicted as "ال" . Not sure how much relevancy that bears in the context of training models, but لا is no letter! It's a ligature ("Arabic Ligature Lam with Alef") formed by combining ل ("Arabic Letter Lam") with ا ("Arabic Letter Alef") whereas ال is ا followed by

Re: [tesseract-ocr] Tesseract-OCR Training Arabic text & numbers

2020-07-12 Thread Shree Devi Kumar
@Eliyaz What version of tesseract are you using? Which traineddata? >Always the letter "لا" is predicted as "ال" . I think this was fixed by Ray Smiith in 2017 and should be ok in the traineddata files in tessdata_fast and tessdata_best repos. On Sun, Jul 12, 2020 at 6:45 PM Rainer Verteidiger <

Re: [tesseract-ocr] Tesseract-OCR Training Arabic text & numbers

2020-07-12 Thread Shree Devi Kumar
See https://github.com/tesseract-ocr/tesseract/issues/758 and other similar issues On Sun, Jul 12, 2020 at 6:52 PM Shree Devi Kumar wrote: > @Eliyaz What version of tesseract are you using? Which traineddata? > > >Always the letter "لا" is predicted as "ال" . > > I think this was fixed by Ray Sm

Re: [tesseract-ocr] Tesseract-OCR Training Arabic text & numbers

2020-07-12 Thread Eliyaz L
Hi Shree, i was using thie below version. I guess you are right its 2016 file. Let me test with latest traineddata. https://tesseract-ocr.github.io/tessdoc/Data-Files https://github.com/tesseract-ocr/tessdata/raw/4.00/ara.traineddata Meanwhile can u pls help me with arabic number. i tried ara_

Re: [tesseract-ocr] Tesseract-OCR Training Arabic text & numbers

2020-07-12 Thread Shree Devi Kumar
If I recall correctly, ara_number.traineddata has been trained for legacy engine. You cannot use two traineddata files each using a different engine. Regarding training of Arabic numbers and punctuation, it is currently an open issue. If you use the latest code from tesstrain repo it should automa