[tesseract-ocr] Re: Tesseract performance On ID cards and passports

nguyen ngoc hai Fri, 08 Sep 2023 05:18:47 -0700

Hi Alexey, 

Thank you very much for trying out my sample. It is very informative to 
understand how CRAFT could extract correctly the text regions. As far as I 
know, Tesseract has a very nice Python wrapper tesserocr 
<https://github.com/sirfz/tesserocr/tree/9c8740dae227f60e5a3c2763783d52f19119172b>,
 
which provides many easy-to-use methods to analyze the image texts with a 
range of PSM, and RIL modes. However, unfortunately, I was not able to find 
a good method from the API to extract efficiently all the text regions for 
some multi-background and text colors samples.


The results you provided are actually very promising. I have not read your 
code carefully yet, but may I ask that after getting all the text regions, 
did you pass them one by one to Tesseract or how did you get the CRAFT + 
crop result: ... (with accuracy 48.78)? 

As I noticed, some lines on the sample can be noisy for the results. I 
think that if applying the line removal method, the results can be better.  
I do not quite understand the technique of creating a map of boxes using 
.uzn files and passing it to Tesseract, can you explain a bit further? And 
yes, you are right, not only the 金額 was not there, but all of the 
dark-background text regions are not shown in the second results ( such as 
摘要 数 重 単位 単 価 金額, etc. )

Apology for the conversation becoming longer, but your questions are yet to 
be answered. I am deeply interested in understanding them too. 

Regards
Hai. 

On Friday, September 8, 2023 at 5:19:26 AM UTC+9 apism...@gmail.com wrote:

> Thanks for sharing Hai 
> Looks like CRAFT can detect regions despite the background: 
> https://github.com/apismensky/ocr_id/blob/main/images/boxes_craft/black_background_text_detection.png
> . 
> It also creates cuts for each text region which can be OCR-ed separately 
> and then joined together as a result.
> When I ran your example with 
> https://github.com/apismensky/ocr_id/blob/main/ocr_id.py I've got the 
> following output: 
>
> CRAFT + crop result: 下記 の と お り 、 御 請求 申し 上 げ ま す 。 株 式 会 社 ス キャ ナ 保 存 スト 
> レー ジ プ ロ ジェ クト 件 名 T573-0011 2023/4/30 大 阪 市 北 区 大 深町 3-1 支払 期限 山口 銀行 本 店 
> 普通 1111111 グラ ン フ ロン ト 大 阪 タ ワーB 振込 先 TEL : 06-6735-8055 担当 : ICS 太 郎 
> 66,000 円 (税込 ) a 摘要 数 重 単位 単 価 金額 サン プル 1 1 式 32,000 32,000 サン プル 2 1 式 
> 18000 18,000 2,000 2,000' 8g,000' 2,000
> crop word_accuracy: 48.78048780487805
>
> I've tried to create a map of boxes using .uzn files and pass it to 
> tesseract, but results are worse: 
> CRAFT result: 下記 の と お り 、 御 請求 申し 上 げ ま す 。
>
> 株 式 会 社 ス キャ ナ 保 存
>
> スト レー ジ プ ロ ジェ クト
>
> 〒573-0011
>
> 2023/4/30
>
> 大 阪 市 北 区 大 深町 3-1
>
> 山口 銀行 本 店 普通 1111111
>
> グラ ン フ ロン ト 大 阪 タ ワーB
>
> TEL : 06-6735-8055
>
> 担当 : ICS 太 郎
>
> 66,000 円 (税込 )
>
> サン プル 1
>
> 1| 式
>
> 32,000
>
> 32,000
>
> サン プル 2
>
> 1| 式
>
> 18000
>
> 18,000
>
> 2,000
>
> 2,000.
>
> 8,000
>
> 8,000
>
> craft word_accuracy: 36.58536585365854. 
>
> Apparently 金額 is not there; 
> Sorry, my Japanese is little bit rusty :-) 
> I have an impression that when I pass the map with .uzn text regions to 
> tesseract it applies one transformation to pre-process the image, but when 
> I'm passing each individual images it preprocess it separately, applying 
> the best strategy for each region? Of course it is slower this way. 
>  
> On Wednesday, September 6, 2023 at 7:07:52 PM UTC-6 nguyenng...@gmail.com 
> wrote:
>
>> Hi Apismensky,
>>
>> Here are the code and sample I used for preprocessing, I extracted the 
>> ticket region of the train ticket from a picture taken by a smartphone. 
>> Since the angle, distance, brightness, and many other factors can change 
>> the picture quality. 
>> I would say scanned images or fixed-position camera-taken images have 
>> more consistent quality. 
>>
>> Here is the original image:
>>
>> [image: sample_to_remove_lines.png]
>>
>> # TRy to remove lines
>> org_image = cv2.imread("/content/sample_to_remove_lines.png")
>> cv2_show('org_image', org_image)
>> gray = cv2.cvtColor(org_image,cv2.COLOR_BGR2GRAY)
>>
>> thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + 
>> cv2.THRESH_OTSU)[1]
>> cv2_show('thresh Otsu', thresh)
>>
>>
>> # removing noise dots.
>> opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, np.ones((2,2),np.uint8
>> ))
>> cv2_show('opening', opening)
>>
>> thresh = opening.copy()
>> mask = np.zeros_like(org_image, dtype=np.uint8)
>>
>> # Extract horizontal lines
>> horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (60 ,1))
>> remove_horizontal = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, 
>> horizontal_kernel, iterations=2)
>> cnts = cv2.findContours(remove_horizontal, cv2.RETR_EXTERNAL, 
>> cv2.CHAIN_APPROX_SIMPLE)
>> cnts = cnts[0] if len(cnts) == 2 else cnts[1]
>> for c in cnts:
>>     cv2.drawContours(mask, [c], -1, (255, 255, 255), 8)
>> # cv2_show('mask extract horizontal lines', mask)
>>
>> # Extract vertical lines
>> vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1,70))
>> remove_vertical = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, 
>> vertical_kernel, iterations=2)
>> cnts = cv2.findContours(remove_vertical, cv2.RETR_EXTERNAL, 
>> cv2.CHAIN_APPROX_SIMPLE)
>> cnts = cnts[0] if len(cnts) == 2 else cnts[1]
>> for c in cnts:
>>     cv2.drawContours(mask, [c], -1, (255, 255, 255), 8)
>>
>> cv2_show('mask extract lines', mask)
>>
>> result = org_image.copy()
>> # Loop through the pixels of the original image and modify based on the 
>> mask
>> for y in range(mask.shape[0]):
>>     for x in range(mask.shape[1]):
>>         if np.all(mask[y, x] == 255):  # If pixel is white in mask
>>             result[y, x] = [255, 255, 255]  # Set pixel to white
>>
>> cv2_show("result", result2)
>>
>> gray = cv2.cvtColor(result2,cv2.COLOR_BGR2GRAY)
>> _, simple_thresh = cv2.threshold(gray, 195, 255, cv2.THRESH_BINARY)
>> cv2_show('simple_thresh', simple_thresh)
>>
>>
>> in the above code, u can ignore the cv2_show function since it is just my 
>> custom method for showing images. 
>> You can see that the idea is to remove some noise, remove lines, and then 
>> simple-thresh. 
>> [image: extracted_lines.png]
>>
>> [image: removed_lines.png]
>>
>>
>> [image: ready_for_locating_text_box.png]
>>
>> I would say, from this point, the AUTO_OSD mode of Tesseract PSM can also 
>> give the text box for the above picture, it also needs to check with RIL 
>> mode (maybe RIL.WORD or RIL.TEXTLINE) to get the right level of textboxes. 
>> In my opinion, the same preprocessing methods can only be applied to a 
>> certain group of samples. It is in fact very hard to cover all the cases.  
>> For example: 
>>
>> [image: black_background.png]
>>
>> I found it difficult to locate the text box where the text is white, and 
>> the background is dark colors. the black text on the white background is 
>> easy to locate and then OCR. I am not sure what are a good method to locate 
>> those white texts on the dark background colors.
>> I hope to hear your as well as others's suggestions on this matter. 
>>
>> Regards
>> Hai
>> On Wednesday, September 6, 2023 at 12:32:56 AM UTC+9 apism...@gmail.com 
>> wrote:
>>
>>> Hai, could you please tell me what you are doing for pre-processing? 
>>> Do you have any source code you can share? 
>>> Are those results consistently better for images scanned with different 
>>> quality (resolution, angles, contrast etc)? 
>>>
>>>
>>> On Monday, September 4, 2023 at 2:02:27 AM UTC-6 nguyenng...@gmail.com 
>>> wrote:
>>>
>>>> Hi, 
>>>> I would like to hear other's opinions on your questions too. 
>>>> In my case, when I try using Tesseract for Japan train tickets, I have 
>>>> to do a lot of steps for preprocessing (remove background colors, noise + 
>>>> line removal, increase contrast,  etc.) to get satisfactory results. 
>>>> I am sure what you are doing (locating text boxes, extracting them, and 
>>>> feeding them one by one to tesseract) can get better accuracy results. 
>>>> However, when the number of text boxes increases, it will undoubtedly 
>>>> affect your performance. 
>>>> Could you share the PSM mode for getting those text boxes' locations ?  
>>>> I usually use the AUTO_OSD to get the boxes and expand them a bit at 
>>>> the edges before passing them to Tesseract. 
>>>>
>>>> Regards
>>>> Hai
>>>>  
>>>> On Saturday, September 2, 2023 at 7:03:49 AM UTC+9 apism...@gmail.com 
>>>> wrote:
>>>>
>>>>> I'm looking into OCR for ID cards and drivers licenses, and I found 
>>>>> out that tesseract performs relatively poor on ID cards, compared to 
>>>>> other 
>>>>> OCR solutions. For this original image: 
>>>>> https://github.com/apismensky/ocr_id/blob/main/images/boxes_easy/AR.png 
>>>>> the results are: 
>>>>>
>>>>> tesseract: "4d DL 999 as = Ne allo) 2NICK © , q 12 RESTR oe } lick: 5 
>>>>> DD 8888888888 <(888)%20888-8888> 1234 SZ"
>>>>> easyocr:  '''9 , ARKANSAS DRIVER'S LICENSE CLAss D 4d DLN 999999999 3 
>>>>> DOB 03/05/1960 ] 2 SCKPLE 123 NORTH STREET CITY AR 12345 ISS 4b EXP 
>>>>> 03/05/2018 03/05/2026 15 SEX 16 HGT 18 EYES 5'-10" BRO 9a END NONE 12 
>>>>> RESTR 
>>>>> NONE Ylck Sorble DD 8888888888 1234 THE'''
>>>>> google cloud vision: """SARKANSAS\nSAMPLE\nSTATE O\n9 CLASS D\n4d DLN 
>>>>> 9999999993 DOB 03/05/1960\nNick Sample\nDRIVER'S LICENSE\n1 SAMPLE\n2 
>>>>> NICK\n8 123 NORTH STREET\nCITY, AR 12345\n4a ISS\n03/05/2018\n15 SEX 16 
>>>>> HGT\nM\n5'-10\"\nGREAT SE\n9a END NONE\n12 RESTR NONE\n5 DD 8888888888 
>>>>> 1234\n4b EXP\n03/05/2026 MS60\n18 EYES\nBRO\nRKANSAS\n0"""
>>>>>
>>>>> and word accuracy is:
>>>>>
>>>>>              tesseract  |  easyocr  |  google
>>>>> words         10.34%    |  68.97%   |  82.76%
>>>>>
>>>>> This is "out if the box" performance, without any preprocessing. I'm 
>>>>> not surprised that google vision is that good compared to others, but 
>>>>> easyocr, which is another open source solution performs much better than 
>>>>> tesseract is this case. I have the whole project dedicated to this, and 
>>>>> all 
>>>>> other results are much better for easyocr: 
>>>>> https://github.com/apismensky/ocr_id/blob/main/result.json, all input 
>>>>> files are files in 
>>>>> https://github.com/apismensky/ocr_id/tree/main/images/sources
>>>>> After digging into it for a little bit, I suspect that bounding box 
>>>>> detection is much better in google (
>>>>> https://github.com/apismensky/ocr_id/blob/main/images/boxes_google/AR.png)
>>>>>  
>>>>> and easyocr (
>>>>> https://github.com/apismensky/ocr_id/blob/main/images/boxes_easy/AR.png), 
>>>>> than in tesseract (
>>>>> https://github.com/apismensky/ocr_id/blob/main/images/boxes_tesseract/AR.png).
>>>>>  
>>>>>
>>>>> I'm pretty sure, about this, cause when I manually cut the text boxes 
>>>>> and feed them to tesseract it works much better. 
>>>>>
>>>>>
>>>>> Now questions: 
>>>>>
>>>>> - What is the part of the codebase in tesseract that is responsible 
>>>>> for text detection and which algorithm is it using? 
>>>>> - What is impacting bounding box detection in tesseract so it fails on 
>>>>> these types of images (complex layouts / background noise... etc)
>>>>> - Is it possible to use the same text detection procedure as easyocr 
>>>>> or improve the existing one?  
>>>>> - Maybe possible to switch text detection algo based on the image type 
>>>>> or make it pluggable where user can configure from several options 
>>>>> A,B,C...
>>>>>
>>>>>
>>>>> Thanks. 
>>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/0f564279-fe68-484d-bbaa-d14ac254b6b9n%40googlegroups.com.

[tesseract-ocr] Re: Tesseract performance On ID cards and passports

Reply via email to