Re: [tesseract-ocr] Tesseract-OCR Training Arabic text & numbers

Sorosh Shiwa Tue, 27 Oct 2020 05:11:03 -0700

hello
thanks a lot for information but how can i use it in flutter?
please reply my question
sorosh shiwa


On Tue, Oct 27, 2020 at 2:36 PM write2...@gmail.com <write2eli...@gmail.com>
wrote:

> not able to extract this. can anyone able to extract this?
>
> On Thursday, August 13, 2020 at 3:31:19 PM UTC+3 Mahmoud Mabrouk wrote:
>
>> for numbers i used this and works fine with AEN numbers
>> https://github.com/ahmed-tea/tessdata_Arabic_Numbers
>>
>>
>> On Thursday, 13 August 2020 13:41:12 UTC+2, Anuradha B wrote:
>>>
>>> I am trying to extract the arabic dates and numbers from the national ID
>>> card.I am using the following code in Anaconda- Jupiter Notebook.I ahve
>>> aalso attached the image I have used and the outputs wrt to using
>>> grayscale,threshold,canny,image etc functions..But all the text extracted
>>> does not extract the dates and numerals.[I have also installed Tesseract
>>> alpha4.0 version.]Please suggest.
>>>
>>> import cv2
>>> import matplotlib.pyplot as plt
>>> from PIL import Image
>>> import pytesseract
>>> import numpy as np
>>> from matplotlib import pyplot as plt
>>> pytesseract.pytesseract.tesseract_cmd = r'C:\Program
>>> Files\Tesseract-OCR\tesseract.exe'
>>>
>>> import cv2
>>> import numpy as np
>>>
>>> img = cv2.imread('image2.jpg')
>>>
>>> # get grayscale image
>>> def get_grayscale(image):
>>>     return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
>>>
>>> # noise removal
>>> def remove_noise(image):
>>>     return cv2.medianBlur(image,5)
>>>
>>> #thresholding
>>> def thresholding(image):
>>>     return cv2.threshold(image, 0, 255, cv2.THRESH_BINARY +
>>> cv2.THRESH_OTSU)[1]
>>>
>>> #dilation
>>> def dilate(image):
>>>     kernel = np.ones((5,5),np.uint8)
>>>     return cv2.dilate(image, kernel, iterations = 1)
>>>
>>> #erosion
>>> def erode(image):
>>>     kernel = np.ones((5,5),np.uint8)
>>>     return cv2.erode(image, kernel, iterations = 1)
>>>
>>> #opening - erosion followed by dilation
>>> def opening(image):
>>>     kernel = np.ones((5,5),np.uint8)
>>>     return cv2.morphologyEx(image, cv2.MORPH_OPEN, kernel)
>>>
>>> #canny edge detection
>>> def canny(image):
>>>     return cv2.Canny(image, 100, 200)
>>>
>>> #skew correction
>>> def deskew(image):
>>>     coords = np.column_stack(np.where(image > 0))
>>>     angle = cv2.minAreaRect(coords)[-1]
>>>     if angle < -45:
>>>         angle = -(90 + angle)
>>>     else:
>>>         angle = -angle
>>>     (h, w) = image.shape[:2]
>>>     center = (w // 2, h // 2)
>>>     M = cv2.getRotationMatrix2D(center, angle, 1.0)
>>>     rotated = cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC,
>>> borderMode=cv2.BORDER_REPLICATE)
>>>     return rotated
>>>
>>> #template matching
>>> def match_template(image, template):
>>>     return cv2.matchTemplate(image, template, cv2.TM_CCOEFF_NORMED)
>>>
>>> image = cv2.imread('image2.jpg')
>>>
>>> gray = get_grayscale(image)
>>> thresh = thresholding(gray)
>>> opening = opening(gray)
>>> canny = canny(gray)
>>>
>>> text = pytesseract.image_to_string(image,lang='eng+ara')
>>> print(text)
>>> print('----------------------------------------------------------------')
>>> text = pytesseract.image_to_string(gray,lang='eng+ara')
>>> print(text)
>>> print('----------------------------------------------------------------')
>>> text = pytesseract.image_to_string(thresh,lang='eng+ara')
>>> print(text)
>>> print('----------------------------------------------------------------')
>>> text = pytesseract.image_to_string(opening,lang='eng+ara')
>>> print(text)
>>> print('----------------------------------------------------------------')
>>> text = pytesseract.image_to_string(canny,lang='eng+ara')
>>> print(text)
>>> On Sunday, 12 July, 2020 at 4:30:40 pm UTC+5:30 shree wrote:
>>>
>>>> What character are you trying to add?
>>>> Please share the training data to try and replicate the issue.
>>>>
>>>>
>>>> On Sun, Jul 12, 2020, 15:35 Eliyaz L <write2...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> My use case is on Arabic document, the pre retrained ara.traineddata
>>>>> are good but not perfect. so i wish to fine tune ara.traineddata, if the
>>>>> results are not satisfying then have train my own custom data.
>>>>>
>>>>>
>>>>> please suggest me for the following:
>>>>>
>>>>>    1. for my use case in Arabic text, problem is in one character
>>>>>    which is always predicting wrong. so do i need to add the document font
>>>>>    (traditional arabic font) and train? if so pls provide the procedure or
>>>>>    link to add one font in pre training ara.traineddata.
>>>>>    2. if fine tuning or training from scratch, how many gt.txt files
>>>>>    i need and how many characters needs to be there in each file? and any 
>>>>> apx
>>>>>    iterations if you know?
>>>>>    3. for number, the prediction is totally wrong on Arabic numbers,
>>>>>    so do i need to start from scratch or need to fine tune? if any then 
>>>>> how to
>>>>>    prepare datasets for the same.
>>>>>    4. how to decide the max_iterations is there any ratio of datasets
>>>>>    and iteration.
>>>>>
>>>>>
>>>>> *Below are my **trails**:*
>>>>>
>>>>>
>>>>> *For Arabic Numbers:*
>>>>>
>>>>>
>>>>> -> i tried to custom train only Arabic numbers.
>>>>> -> i wrote a script to write 100,000 numbers in multiple gt.txt files.
>>>>> 100s of character in each gt.txt file.
>>>>> -> then one script to convert text to image (text2image) which should
>>>>> be more like scanned image.
>>>>> -> parameters used in the below order.
>>>>>
>>>>> text2image --text test.gt.txt --outputbase /home/user/output
>>>>> --fonts_dir /usr/share/fonts/truetype/msttcorefonts/ --font 'Arial'
>>>>> --degrade_image false --rotate_image --exposure 2 --resolution 300
>>>>>
>>>>>    1. How much dataset i need to prepare for arabic number, as of now
>>>>>    required only for 2 specific fonts which i already have.
>>>>>    2. Will dateset be duplicate if i follow this procedure, if yes is
>>>>>    there any way to avoid it.
>>>>>    3. Is that good way to create more gt.txt files with less
>>>>>    characters in it (for eg 50,000 gt files with 10 numbers in each file) 
>>>>> or
>>>>>    less gt.txt files with more characters (for eg 1000 gt files with 500
>>>>>    numbers in each file).
>>>>>
>>>>> If possible please guide me the procedure for datasets preparation.
>>>>>
>>>>> For testing I tried 50,000 eng number, with each number in one gt.txt
>>>>> file (for eg wrote "2500" data in 2500.gt.txt file) with 20,000 iteration
>>>>> but it fails.
>>>>>
>>>>>
>>>>> *For Arabic Text:*
>>>>>
>>>>>
>>>>> -> prepared around 23k gt.txt files each having one sentence
>>>>>
>>>>> -> generated .box and small .tifs files for all gt.txt files using 1
>>>>> font (traditional Arabic font)
>>>>>
>>>>> -> used the tesstrain git and trained for 20,000 iteration
>>>>>
>>>>> -> after training generated foo.traineddata with 0.03 error rate
>>>>>
>>>>> -> did prediction an the real data, it is working perfect for the
>>>>> perticular character which on pre trained (ara.traineddata) failes. but
>>>>> when comes to overall accuracy the pre trained (ara.traineddata) performs
>>>>> better except that one character.
>>>>>
>>>>>
>>>>>
>>>>> *Summery:*
>>>>>
>>>>>
>>>>>
>>>>>    - how to fix one character in pre
>>>>>    trained (ara.traineddata) model or if not possible how to custom
>>>>>    train from scratch or is there a way to annotate on real image and 
>>>>> prepare
>>>>>    dateset, pls suggest the best practice?
>>>>>    - how to prepare Arabic number dataset and train it. if custom
>>>>>    training on number not possible then can arabic numbers added with pre
>>>>>    trained model (ara.traineddata)
>>>>>
>>>>>
>>>>>
>>>>> GitHub link used for custom training Arabic text and numbers:
>>>>> https://github.com/tesseract-ocr/tesstrain
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/09cff705-838f-4ccb-b6e9-06326fea1cdbo%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/09cff705-838f-4ccb-b6e9-06326fea1cdbo%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/c1788de1-48dc-4cf9-99c3-0049b1948747n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/c1788de1-48dc-4cf9-99c3-0049b1948747n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAH%3D2zq6RX%3Dy-OJ%2BP%2BuDr8xyLN5e-qfUVc-BfA9Nwo9_Gf3BxJg%40mail.gmail.com.

Re: [tesseract-ocr] Tesseract-OCR Training Arabic text & numbers

Reply via email to