Re: [tesseract-ocr] Tesseract-OCR Training Arabic text & numbers

write2...@gmail.com Tue, 27 Oct 2020 03:07:02 -0700

not able to extract this. can anyone able to extract this?

On Thursday, August 13, 2020 at 3:31:19 PM UTC+3 Mahmoud Mabrouk wrote:


> for numbers i used this and works fine with AEN numbers 
> https://github.com/ahmed-tea/tessdata_Arabic_Numbers
>
>
> On Thursday, 13 August 2020 13:41:12 UTC+2, Anuradha B wrote:
>>
>> I am trying to extract the arabic dates and numbers from the national ID 
>> card.I am using the following code in Anaconda- Jupiter Notebook.I ahve 
>> aalso attached the image I have used and the outputs wrt to using 
>> grayscale,threshold,canny,image etc functions..But all the text extracted 
>> does not extract the dates and numerals.[I have also installed Tesseract 
>> alpha4.0 version.]Please suggest.
>>
>> import cv2
>> import matplotlib.pyplot as plt
>> from PIL import Image
>> import pytesseract
>> import numpy as np
>> from matplotlib import pyplot as plt
>> pytesseract.pytesseract.tesseract_cmd = r'C:\Program 
>> Files\Tesseract-OCR\tesseract.exe'
>>
>> import cv2
>> import numpy as np
>>
>> img = cv2.imread('image2.jpg')
>>
>> # get grayscale image
>> def get_grayscale(image):
>>     return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
>>
>> # noise removal
>> def remove_noise(image):
>>     return cv2.medianBlur(image,5)
>>  
>> #thresholding
>> def thresholding(image):
>>     return cv2.threshold(image, 0, 255, cv2.THRESH_BINARY + 
>> cv2.THRESH_OTSU)[1]
>>
>> #dilation
>> def dilate(image):
>>     kernel = np.ones((5,5),np.uint8)
>>     return cv2.dilate(image, kernel, iterations = 1)
>>     
>> #erosion
>> def erode(image):
>>     kernel = np.ones((5,5),np.uint8)
>>     return cv2.erode(image, kernel, iterations = 1)
>>
>> #opening - erosion followed by dilation
>> def opening(image):
>>     kernel = np.ones((5,5),np.uint8)
>>     return cv2.morphologyEx(image, cv2.MORPH_OPEN, kernel)
>>
>> #canny edge detection
>> def canny(image):
>>     return cv2.Canny(image, 100, 200)
>>
>> #skew correction
>> def deskew(image):
>>     coords = np.column_stack(np.where(image > 0))
>>     angle = cv2.minAreaRect(coords)[-1]
>>     if angle < -45:
>>         angle = -(90 + angle)
>>     else:
>>         angle = -angle
>>     (h, w) = image.shape[:2]
>>     center = (w // 2, h // 2)
>>     M = cv2.getRotationMatrix2D(center, angle, 1.0)
>>     rotated = cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC, 
>> borderMode=cv2.BORDER_REPLICATE)
>>     return rotated
>>
>> #template matching
>> def match_template(image, template):
>>     return cv2.matchTemplate(image, template, cv2.TM_CCOEFF_NORMED) 
>>
>> image = cv2.imread('image2.jpg')
>>
>> gray = get_grayscale(image)
>> thresh = thresholding(gray)
>> opening = opening(gray)
>> canny = canny(gray)
>>
>> text = pytesseract.image_to_string(image,lang='eng+ara')
>> print(text)
>> print('----------------------------------------------------------------')
>> text = pytesseract.image_to_string(gray,lang='eng+ara')
>> print(text)
>> print('----------------------------------------------------------------')
>> text = pytesseract.image_to_string(thresh,lang='eng+ara')
>> print(text)
>> print('----------------------------------------------------------------')
>> text = pytesseract.image_to_string(opening,lang='eng+ara')
>> print(text)
>> print('----------------------------------------------------------------')
>> text = pytesseract.image_to_string(canny,lang='eng+ara')
>> print(text)
>> On Sunday, 12 July, 2020 at 4:30:40 pm UTC+5:30 shree wrote:
>>
>>> What character are you trying to add?
>>> Please share the training data to try and replicate the issue.
>>>
>>>
>>> On Sun, Jul 12, 2020, 15:35 Eliyaz L <write2...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>> My use case is on Arabic document, the pre retrained ara.traineddata 
>>>> are good but not perfect. so i wish to fine tune ara.traineddata, if the 
>>>> results are not satisfying then have train my own custom data.
>>>>
>>>>
>>>> please suggest me for the following:
>>>>
>>>>    1. for my use case in Arabic text, problem is in one character 
>>>>    which is always predicting wrong. so do i need to add the document font 
>>>>    (traditional arabic font) and train? if so pls provide the procedure or 
>>>>    link to add one font in pre training ara.traineddata.
>>>>    2. if fine tuning or training from scratch, how many gt.txt files i 
>>>>    need and how many characters needs to be there in each file? and any 
>>>> apx 
>>>>    iterations if you know?
>>>>    3. for number, the prediction is totally wrong on Arabic numbers, 
>>>>    so do i need to start from scratch or need to fine tune? if any then 
>>>> how to 
>>>>    prepare datasets for the same.
>>>>    4. how to decide the max_iterations is there any ratio of datasets 
>>>>    and iteration.
>>>>
>>>>
>>>> *Below are my **trails**:*
>>>>
>>>>
>>>> *For Arabic Numbers:*
>>>>
>>>>
>>>> -> i tried to custom train only Arabic numbers.
>>>> -> i wrote a script to write 100,000 numbers in multiple gt.txt files. 
>>>> 100s of character in each gt.txt file.
>>>> -> then one script to convert text to image (text2image) which should 
>>>> be more like scanned image.
>>>> -> parameters used in the below order.
>>>>
>>>> text2image --text test.gt.txt --outputbase /home/user/output 
>>>> --fonts_dir /usr/share/fonts/truetype/msttcorefonts/ --font 'Arial' 
>>>> --degrade_image false --rotate_image --exposure 2 --resolution 300
>>>>
>>>>    1. How much dataset i need to prepare for arabic number, as of now 
>>>>    required only for 2 specific fonts which i already have.
>>>>    2. Will dateset be duplicate if i follow this procedure, if yes is 
>>>>    there any way to avoid it.
>>>>    3. Is that good way to create more gt.txt files with less 
>>>>    characters in it (for eg 50,000 gt files with 10 numbers in each file) 
>>>> or 
>>>>    less gt.txt files with more characters (for eg 1000 gt files with 500 
>>>>    numbers in each file).  
>>>>
>>>> If possible please guide me the procedure for datasets preparation.
>>>>
>>>> For testing I tried 50,000 eng number, with each number in one gt.txt 
>>>> file (for eg wrote "2500" data in 2500.gt.txt file) with 20,000 iteration 
>>>> but it fails.
>>>>
>>>>
>>>> *For Arabic Text:*
>>>>
>>>>
>>>> -> prepared around 23k gt.txt files each having one sentence
>>>>
>>>> -> generated .box and small .tifs files for all gt.txt files using 1 
>>>> font (traditional Arabic font)
>>>>
>>>> -> used the tesstrain git and trained for 20,000 iteration
>>>>
>>>> -> after training generated foo.traineddata with 0.03 error rate
>>>>
>>>> -> did prediction an the real data, it is working perfect for the 
>>>> perticular character which on pre trained (ara.traineddata) failes. but 
>>>> when comes to overall accuracy the pre trained (ara.traineddata) performs 
>>>> better except that one character.
>>>>
>>>>
>>>>
>>>> *Summery:*
>>>>
>>>>
>>>>
>>>>    - how to fix one character in pre 
>>>>    trained (ara.traineddata) model or if not possible how to custom 
>>>>    train from scratch or is there a way to annotate on real image and 
>>>> prepare 
>>>>    dateset, pls suggest the best practice?
>>>>    - how to prepare Arabic number dataset and train it. if custom 
>>>>    training on number not possible then can arabic numbers added with pre 
>>>>    trained model (ara.traineddata)  
>>>>
>>>>  
>>>>
>>>> GitHub link used for custom training Arabic text and numbers: 
>>>> https://github.com/tesseract-ocr/tesstrain
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/09cff705-838f-4ccb-b6e9-06326fea1cdbo%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/09cff705-838f-4ccb-b6e9-06326fea1cdbo%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c1788de1-48dc-4cf9-99c3-0049b1948747n%40googlegroups.com.

Re: [tesseract-ocr] Tesseract-OCR Training Arabic text & numbers

Reply via email to