Dear Zdenko and everyone, Thank you for your help last time.
Apologize for getting back a bit late, I could get the same results by using the same language which you suggested. However, the language model gave me less accurate OCR results than the language model in *tessdata_best*. It is troublesome, but would it be possible for tesseract to specify a different model (from the same language)? For example: Use the Legacy model for OSD, and use the tessdata_best model for extracting text. Please also forgive me that due to the data privacy matter, I will have to delete the uploaded image from the post later. Thank you for your time. Best regards Hai On Sunday, March 12, 2023 at 2:55:52 AM UTC+9 zdenop wrote: > one more thing: I used a language file from > https://github.com/tesseract-ocr/tessdata e.g. with legacy engine data. > > Zdenko > > > so 11. 3. 2023 o 13:18 nguyen ngoc hai <nguyenng...@gmail.com> napísal(a): > >> Thank you very much for your help. >> I will give it a try. >> >> Best regards >> Hai >> >> >> On Sat, Mar 11, 2023, 8:14 PM Zdenko Podobny <zde...@gmail.com> wrote: >> >>> the latest code (5.3.0) (on windows) >>> >>> Zdenko >>> >>> >>> so 11. 3. 2023 o 2:16 nguyen ngoc hai <nguyenng...@gmail.com> >>> napísal(a): >>> >>>> Dear Zdenko, >>>> >>>> Thank you very much for your suggestion. >>>> >>>> May I ask which version of tesseract are you using? >>>> I ran the same command with tesseract v5.0.0, but I got a different >>>> result. >>>> >>>> ``` >>>> >tesseract -v >>>> tesseract v5.0.0-alpha.20210811 >>>> ... >>>> Warning, detects only orientation with -l jpn >>>> Page number: 0 >>>> Orientation in degrees: 270 >>>> Rotate: 90 >>>> Orientation confidence: 46.00 >>>> Script: Latin >>>> Script confidence: 2.00 >>>> ``` >>>> Should I upgrade to the newest version of tesseract or try some extra >>>> preprocessing methods before detecting text orientation? >>>> Thank you for your time. >>>> Best regards >>>> Hai >>>> >>>> >>>> >>>> On Sat, Mar 11, 2023 at 5:34 AM Zdenko Podobny <zde...@gmail.com> >>>> wrote: >>>> >>>>> script detection was always problematic and tesseract try to >>>>> identify only a few... >>>>> >>>>> Regarding rotation you can get better results by using the language >>>>> file: >>>>> >tesseract unnamed.jpg - --psm 0 -l jpn >>>>> Warning, detects only orientation with -l jpn >>>>> Estimating resolution as 262 >>>>> Warning. Invalid resolution 0 dpi. Using 70 instead. >>>>> Page number: 0 >>>>> Orientation in degrees: 90 >>>>> Rotate: 270 >>>>> Orientation confidence: 6.44 >>>>> Script: Han >>>>> Script confidence: 1.43 >>>>> >>>>> Zdenko >>>>> >>>>> >>>>> pi 10. 3. 2023 o 18:21 nguyen ngoc hai <nguyenng...@gmail.com> >>>>> napísal(a): >>>>> >>>>>> I have the following image: >>>>>> >>>>>> [image: 17_Receipt Transform No resize.jpg] >>>>>> >>>>>> I used the following code to get the text orientation, it works for >>>>>> most of my samples except the above image. >>>>>> >>>>>> ```python >>>>>> def get_orientation_confidence(cv2_img_data): >>>>>> image = cv2pil(cv2_img_data) >>>>>> osd_result = {} >>>>>> >>>>>> with tesserocr.PyTessBaseAPI(lang='osd') as api: >>>>>> api.SetImage(image) >>>>>> api.SetSourceResolution(300) >>>>>> >>>>>> osd_result = api.DetectOrientationScript() >>>>>> >>>>>> return osd_result >>>>>> >>>>>> # preprocess image before detecting orientation >>>>>> gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) >>>>>> gray_white_border = self.make_border_white(gray) >>>>>> self.show_image("gray_white_border", gray_white_border) >>>>>> >>>>>> # Threshold the image to convert it to black and white >>>>>> threshold = cv2.threshold(gray_white_border, 0, 255, >>>>>> cv2.THRESH_OTSU)[1] >>>>>> self.show_image("threshold otsu", threshold) >>>>>> >>>>>> osd_ret = get_orientation_confidence(pre_roi_im) >>>>>> print(osd_ret['orient_deg']) >>>>>> ``` >>>>>> ```cmd >>>>>> {'orient_deg': 180, 'orient_conf': 0.06795501708984375, >>>>>> 'script_name': 'Arabic', 'script_conf': 0.0} >>>>>> ``` >>>>>> Here, the results I got were not correct, and also wrong language >>>>>> detection. >>>>>> >>>>>> I hope to get {'orient_deg': 90, 'script_name': 'Japanese', ...} >>>>>> I supposed the results belonged to tesseract's output results. >>>>>> >>>>>> Is that possible to get the correct orientation degree here? >>>>>> Assuming that I already know the language, are there any methods >>>>>> (such as applying extra image preprocessing, etc.) that can provide >>>>>> better >>>>>> accuracy here? >>>>>> >>>>>> Thank you very much for your time. >>>>>> I hope to hear any suggestions. >>>>>> >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/e447f23e-a0e1-4a91-b6e1-0eca8511f7acn%40googlegroups.com >>>>>> >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/e447f23e-a0e1-4a91-b6e1-0eca8511f7acn%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>> -- >>>>> You received this message because you are subscribed to a topic in the >>>>> Google Groups "tesseract-ocr" group. >>>>> To unsubscribe from this topic, visit >>>>> https://groups.google.com/d/topic/tesseract-ocr/CPTtW5bPqYc/unsubscribe >>>>> . >>>>> To unsubscribe from this group and all its topics, send an email to >>>>> tesseract-oc...@googlegroups.com. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xoY%2BTVbQLuSXXN3u-5LEAPpZ4nq7CJHdFRXLQJta2yBQ%40mail.gmail.com >>>>> >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xoY%2BTVbQLuSXXN3u-5LEAPpZ4nq7CJHdFRXLQJta2yBQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> >>>> >>>> -- >>>> *Nguyen Ngoc Hai* >>>> >>>> *Phone: +81 1488 4168 (JP).* >>>> *skype ID: nguyenngochaibkhn.* >>>> >>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-oc...@googlegroups.com. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/CA%2BWjAfqTWpZ5rbkAUFVY2-cKhKBFq3CY33bAaCyVLtv3tsGWXw%40mail.gmail.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/CA%2BWjAfqTWpZ5rbkAUFVY2-cKhKBFq3CY33bAaCyVLtv3tsGWXw%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >>> You received this message because you are subscribed to a topic in the >>> Google Groups "tesseract-ocr" group. >>> To unsubscribe from this topic, visit >>> https://groups.google.com/d/topic/tesseract-ocr/CPTtW5bPqYc/unsubscribe. >>> To unsubscribe from this group and all its topics, send an email to >>> tesseract-oc...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wZDotnyN8NpGpbDPPrpWG7vDJj_sX6XrOZAUsfa888qw%40mail.gmail.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wZDotnyN8NpGpbDPPrpWG7vDJj_sX6XrOZAUsfa888qw%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com. >> > To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CA%2BWjAfoP4JY4%2BLEfAKvA2qrua86jh5jf6KWJoaMoBiL2hvp_Jg%40mail.gmail.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/CA%2BWjAfoP4JY4%2BLEfAKvA2qrua86jh5jf6KWJoaMoBiL2hvp_Jg%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6413a4ae-7255-4533-9654-f28cc54caa61n%40googlegroups.com.