hello thanks a lot for information but how can i use it in flutter? please reply my question sorosh shiwa
On Tue, Oct 27, 2020 at 2:36 PM write2...@gmail.com <write2eli...@gmail.com> wrote: > not able to extract this. can anyone able to extract this? > > On Thursday, August 13, 2020 at 3:31:19 PM UTC+3 Mahmoud Mabrouk wrote: > >> for numbers i used this and works fine with AEN numbers >> https://github.com/ahmed-tea/tessdata_Arabic_Numbers >> >> >> On Thursday, 13 August 2020 13:41:12 UTC+2, Anuradha B wrote: >>> >>> I am trying to extract the arabic dates and numbers from the national ID >>> card.I am using the following code in Anaconda- Jupiter Notebook.I ahve >>> aalso attached the image I have used and the outputs wrt to using >>> grayscale,threshold,canny,image etc functions..But all the text extracted >>> does not extract the dates and numerals.[I have also installed Tesseract >>> alpha4.0 version.]Please suggest. >>> >>> import cv2 >>> import matplotlib.pyplot as plt >>> from PIL import Image >>> import pytesseract >>> import numpy as np >>> from matplotlib import pyplot as plt >>> pytesseract.pytesseract.tesseract_cmd = r'C:\Program >>> Files\Tesseract-OCR\tesseract.exe' >>> >>> import cv2 >>> import numpy as np >>> >>> img = cv2.imread('image2.jpg') >>> >>> # get grayscale image >>> def get_grayscale(image): >>> return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) >>> >>> # noise removal >>> def remove_noise(image): >>> return cv2.medianBlur(image,5) >>> >>> #thresholding >>> def thresholding(image): >>> return cv2.threshold(image, 0, 255, cv2.THRESH_BINARY + >>> cv2.THRESH_OTSU)[1] >>> >>> #dilation >>> def dilate(image): >>> kernel = np.ones((5,5),np.uint8) >>> return cv2.dilate(image, kernel, iterations = 1) >>> >>> #erosion >>> def erode(image): >>> kernel = np.ones((5,5),np.uint8) >>> return cv2.erode(image, kernel, iterations = 1) >>> >>> #opening - erosion followed by dilation >>> def opening(image): >>> kernel = np.ones((5,5),np.uint8) >>> return cv2.morphologyEx(image, cv2.MORPH_OPEN, kernel) >>> >>> #canny edge detection >>> def canny(image): >>> return cv2.Canny(image, 100, 200) >>> >>> #skew correction >>> def deskew(image): >>> coords = np.column_stack(np.where(image > 0)) >>> angle = cv2.minAreaRect(coords)[-1] >>> if angle < -45: >>> angle = -(90 + angle) >>> else: >>> angle = -angle >>> (h, w) = image.shape[:2] >>> center = (w // 2, h // 2) >>> M = cv2.getRotationMatrix2D(center, angle, 1.0) >>> rotated = cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC, >>> borderMode=cv2.BORDER_REPLICATE) >>> return rotated >>> >>> #template matching >>> def match_template(image, template): >>> return cv2.matchTemplate(image, template, cv2.TM_CCOEFF_NORMED) >>> >>> image = cv2.imread('image2.jpg') >>> >>> gray = get_grayscale(image) >>> thresh = thresholding(gray) >>> opening = opening(gray) >>> canny = canny(gray) >>> >>> text = pytesseract.image_to_string(image,lang='eng+ara') >>> print(text) >>> print('----------------------------------------------------------------') >>> text = pytesseract.image_to_string(gray,lang='eng+ara') >>> print(text) >>> print('----------------------------------------------------------------') >>> text = pytesseract.image_to_string(thresh,lang='eng+ara') >>> print(text) >>> print('----------------------------------------------------------------') >>> text = pytesseract.image_to_string(opening,lang='eng+ara') >>> print(text) >>> print('----------------------------------------------------------------') >>> text = pytesseract.image_to_string(canny,lang='eng+ara') >>> print(text) >>> On Sunday, 12 July, 2020 at 4:30:40 pm UTC+5:30 shree wrote: >>> >>>> What character are you trying to add? >>>> Please share the training data to try and replicate the issue. >>>> >>>> >>>> On Sun, Jul 12, 2020, 15:35 Eliyaz L <write2...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> >>>>> My use case is on Arabic document, the pre retrained ara.traineddata >>>>> are good but not perfect. so i wish to fine tune ara.traineddata, if the >>>>> results are not satisfying then have train my own custom data. >>>>> >>>>> >>>>> please suggest me for the following: >>>>> >>>>> 1. for my use case in Arabic text, problem is in one character >>>>> which is always predicting wrong. so do i need to add the document font >>>>> (traditional arabic font) and train? if so pls provide the procedure or >>>>> link to add one font in pre training ara.traineddata. >>>>> 2. if fine tuning or training from scratch, how many gt.txt files >>>>> i need and how many characters needs to be there in each file? and any >>>>> apx >>>>> iterations if you know? >>>>> 3. for number, the prediction is totally wrong on Arabic numbers, >>>>> so do i need to start from scratch or need to fine tune? if any then >>>>> how to >>>>> prepare datasets for the same. >>>>> 4. how to decide the max_iterations is there any ratio of datasets >>>>> and iteration. >>>>> >>>>> >>>>> *Below are my **trails**:* >>>>> >>>>> >>>>> *For Arabic Numbers:* >>>>> >>>>> >>>>> -> i tried to custom train only Arabic numbers. >>>>> -> i wrote a script to write 100,000 numbers in multiple gt.txt files. >>>>> 100s of character in each gt.txt file. >>>>> -> then one script to convert text to image (text2image) which should >>>>> be more like scanned image. >>>>> -> parameters used in the below order. >>>>> >>>>> text2image --text test.gt.txt --outputbase /home/user/output >>>>> --fonts_dir /usr/share/fonts/truetype/msttcorefonts/ --font 'Arial' >>>>> --degrade_image false --rotate_image --exposure 2 --resolution 300 >>>>> >>>>> 1. How much dataset i need to prepare for arabic number, as of now >>>>> required only for 2 specific fonts which i already have. >>>>> 2. Will dateset be duplicate if i follow this procedure, if yes is >>>>> there any way to avoid it. >>>>> 3. Is that good way to create more gt.txt files with less >>>>> characters in it (for eg 50,000 gt files with 10 numbers in each file) >>>>> or >>>>> less gt.txt files with more characters (for eg 1000 gt files with 500 >>>>> numbers in each file). >>>>> >>>>> If possible please guide me the procedure for datasets preparation. >>>>> >>>>> For testing I tried 50,000 eng number, with each number in one gt.txt >>>>> file (for eg wrote "2500" data in 2500.gt.txt file) with 20,000 iteration >>>>> but it fails. >>>>> >>>>> >>>>> *For Arabic Text:* >>>>> >>>>> >>>>> -> prepared around 23k gt.txt files each having one sentence >>>>> >>>>> -> generated .box and small .tifs files for all gt.txt files using 1 >>>>> font (traditional Arabic font) >>>>> >>>>> -> used the tesstrain git and trained for 20,000 iteration >>>>> >>>>> -> after training generated foo.traineddata with 0.03 error rate >>>>> >>>>> -> did prediction an the real data, it is working perfect for the >>>>> perticular character which on pre trained (ara.traineddata) failes. but >>>>> when comes to overall accuracy the pre trained (ara.traineddata) performs >>>>> better except that one character. >>>>> >>>>> >>>>> >>>>> *Summery:* >>>>> >>>>> >>>>> >>>>> - how to fix one character in pre >>>>> trained (ara.traineddata) model or if not possible how to custom >>>>> train from scratch or is there a way to annotate on real image and >>>>> prepare >>>>> dateset, pls suggest the best practice? >>>>> - how to prepare Arabic number dataset and train it. if custom >>>>> training on number not possible then can arabic numbers added with pre >>>>> trained model (ara.traineddata) >>>>> >>>>> >>>>> >>>>> GitHub link used for custom training Arabic text and numbers: >>>>> https://github.com/tesseract-ocr/tesstrain >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesseract-oc...@googlegroups.com. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/09cff705-838f-4ccb-b6e9-06326fea1cdbo%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/09cff705-838f-4ccb-b6e9-06326fea1cdbo%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/c1788de1-48dc-4cf9-99c3-0049b1948747n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/c1788de1-48dc-4cf9-99c3-0049b1948747n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAH%3D2zq6RX%3Dy-OJ%2BP%2BuDr8xyLN5e-qfUVc-BfA9Nwo9_Gf3BxJg%40mail.gmail.com.