not able to extract this. can anyone able to extract this? On Thursday, August 13, 2020 at 3:31:19 PM UTC+3 Mahmoud Mabrouk wrote:
> for numbers i used this and works fine with AEN numbers > https://github.com/ahmed-tea/tessdata_Arabic_Numbers > > > On Thursday, 13 August 2020 13:41:12 UTC+2, Anuradha B wrote: >> >> I am trying to extract the arabic dates and numbers from the national ID >> card.I am using the following code in Anaconda- Jupiter Notebook.I ahve >> aalso attached the image I have used and the outputs wrt to using >> grayscale,threshold,canny,image etc functions..But all the text extracted >> does not extract the dates and numerals.[I have also installed Tesseract >> alpha4.0 version.]Please suggest. >> >> import cv2 >> import matplotlib.pyplot as plt >> from PIL import Image >> import pytesseract >> import numpy as np >> from matplotlib import pyplot as plt >> pytesseract.pytesseract.tesseract_cmd = r'C:\Program >> Files\Tesseract-OCR\tesseract.exe' >> >> import cv2 >> import numpy as np >> >> img = cv2.imread('image2.jpg') >> >> # get grayscale image >> def get_grayscale(image): >> return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) >> >> # noise removal >> def remove_noise(image): >> return cv2.medianBlur(image,5) >> >> #thresholding >> def thresholding(image): >> return cv2.threshold(image, 0, 255, cv2.THRESH_BINARY + >> cv2.THRESH_OTSU)[1] >> >> #dilation >> def dilate(image): >> kernel = np.ones((5,5),np.uint8) >> return cv2.dilate(image, kernel, iterations = 1) >> >> #erosion >> def erode(image): >> kernel = np.ones((5,5),np.uint8) >> return cv2.erode(image, kernel, iterations = 1) >> >> #opening - erosion followed by dilation >> def opening(image): >> kernel = np.ones((5,5),np.uint8) >> return cv2.morphologyEx(image, cv2.MORPH_OPEN, kernel) >> >> #canny edge detection >> def canny(image): >> return cv2.Canny(image, 100, 200) >> >> #skew correction >> def deskew(image): >> coords = np.column_stack(np.where(image > 0)) >> angle = cv2.minAreaRect(coords)[-1] >> if angle < -45: >> angle = -(90 + angle) >> else: >> angle = -angle >> (h, w) = image.shape[:2] >> center = (w // 2, h // 2) >> M = cv2.getRotationMatrix2D(center, angle, 1.0) >> rotated = cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC, >> borderMode=cv2.BORDER_REPLICATE) >> return rotated >> >> #template matching >> def match_template(image, template): >> return cv2.matchTemplate(image, template, cv2.TM_CCOEFF_NORMED) >> >> image = cv2.imread('image2.jpg') >> >> gray = get_grayscale(image) >> thresh = thresholding(gray) >> opening = opening(gray) >> canny = canny(gray) >> >> text = pytesseract.image_to_string(image,lang='eng+ara') >> print(text) >> print('----------------------------------------------------------------') >> text = pytesseract.image_to_string(gray,lang='eng+ara') >> print(text) >> print('----------------------------------------------------------------') >> text = pytesseract.image_to_string(thresh,lang='eng+ara') >> print(text) >> print('----------------------------------------------------------------') >> text = pytesseract.image_to_string(opening,lang='eng+ara') >> print(text) >> print('----------------------------------------------------------------') >> text = pytesseract.image_to_string(canny,lang='eng+ara') >> print(text) >> On Sunday, 12 July, 2020 at 4:30:40 pm UTC+5:30 shree wrote: >> >>> What character are you trying to add? >>> Please share the training data to try and replicate the issue. >>> >>> >>> On Sun, Jul 12, 2020, 15:35 Eliyaz L <write2...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> >>>> My use case is on Arabic document, the pre retrained ara.traineddata >>>> are good but not perfect. so i wish to fine tune ara.traineddata, if the >>>> results are not satisfying then have train my own custom data. >>>> >>>> >>>> please suggest me for the following: >>>> >>>> 1. for my use case in Arabic text, problem is in one character >>>> which is always predicting wrong. so do i need to add the document font >>>> (traditional arabic font) and train? if so pls provide the procedure or >>>> link to add one font in pre training ara.traineddata. >>>> 2. if fine tuning or training from scratch, how many gt.txt files i >>>> need and how many characters needs to be there in each file? and any >>>> apx >>>> iterations if you know? >>>> 3. for number, the prediction is totally wrong on Arabic numbers, >>>> so do i need to start from scratch or need to fine tune? if any then >>>> how to >>>> prepare datasets for the same. >>>> 4. how to decide the max_iterations is there any ratio of datasets >>>> and iteration. >>>> >>>> >>>> *Below are my **trails**:* >>>> >>>> >>>> *For Arabic Numbers:* >>>> >>>> >>>> -> i tried to custom train only Arabic numbers. >>>> -> i wrote a script to write 100,000 numbers in multiple gt.txt files. >>>> 100s of character in each gt.txt file. >>>> -> then one script to convert text to image (text2image) which should >>>> be more like scanned image. >>>> -> parameters used in the below order. >>>> >>>> text2image --text test.gt.txt --outputbase /home/user/output >>>> --fonts_dir /usr/share/fonts/truetype/msttcorefonts/ --font 'Arial' >>>> --degrade_image false --rotate_image --exposure 2 --resolution 300 >>>> >>>> 1. How much dataset i need to prepare for arabic number, as of now >>>> required only for 2 specific fonts which i already have. >>>> 2. Will dateset be duplicate if i follow this procedure, if yes is >>>> there any way to avoid it. >>>> 3. Is that good way to create more gt.txt files with less >>>> characters in it (for eg 50,000 gt files with 10 numbers in each file) >>>> or >>>> less gt.txt files with more characters (for eg 1000 gt files with 500 >>>> numbers in each file). >>>> >>>> If possible please guide me the procedure for datasets preparation. >>>> >>>> For testing I tried 50,000 eng number, with each number in one gt.txt >>>> file (for eg wrote "2500" data in 2500.gt.txt file) with 20,000 iteration >>>> but it fails. >>>> >>>> >>>> *For Arabic Text:* >>>> >>>> >>>> -> prepared around 23k gt.txt files each having one sentence >>>> >>>> -> generated .box and small .tifs files for all gt.txt files using 1 >>>> font (traditional Arabic font) >>>> >>>> -> used the tesstrain git and trained for 20,000 iteration >>>> >>>> -> after training generated foo.traineddata with 0.03 error rate >>>> >>>> -> did prediction an the real data, it is working perfect for the >>>> perticular character which on pre trained (ara.traineddata) failes. but >>>> when comes to overall accuracy the pre trained (ara.traineddata) performs >>>> better except that one character. >>>> >>>> >>>> >>>> *Summery:* >>>> >>>> >>>> >>>> - how to fix one character in pre >>>> trained (ara.traineddata) model or if not possible how to custom >>>> train from scratch or is there a way to annotate on real image and >>>> prepare >>>> dateset, pls suggest the best practice? >>>> - how to prepare Arabic number dataset and train it. if custom >>>> training on number not possible then can arabic numbers added with pre >>>> trained model (ara.traineddata) >>>> >>>> >>>> >>>> GitHub link used for custom training Arabic text and numbers: >>>> https://github.com/tesseract-ocr/tesstrain >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-oc...@googlegroups.com. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/09cff705-838f-4ccb-b6e9-06326fea1cdbo%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/09cff705-838f-4ccb-b6e9-06326fea1cdbo%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c1788de1-48dc-4cf9-99c3-0049b1948747n%40googlegroups.com.