[tesseract-ocr] Re: Tesseract performance On ID cards and passports

nguyen ngoc hai Tue, 12 Sep 2023 17:30:35 -0700

Hi Alexey, 
Thank you very much for your detailed explanation. 
Sorry for my late reply. I got dragged into different matters in the last 
few days.


Apparently, I was not aware of the .uzn file usage for Tesseract before. 
Thank you.  

In my previous project, I did apply preprocessing for each block image (as 
some may have different background noises or low-quality images). However, 
doing that is really not a good approach for such large-size images with a 
great number of text boxes. I used Python multi-process to boost the speed 
up a little bit. With that, depending on the number of CPU cores, we can 
process multiple images parallelly.  

In the above sample of mine, as you almost get 100% correct results from 
the text boxes. I will try to apply some preprocessing methods, to see if 
the results can be improved further. 
I will let you know right after that. 

Meanwhile, I still hope to hear updates on your questions.

Regards
Hai. 
On Saturday, September 9, 2023 at 12:27:53 AM UTC+9 apism...@gmail.com 
wrote:

> Hai, sorry I have missed a lot of details in my last message, so I will 
> try to clarify.
> Disclaimer: I'm not a computer vision guru, nor a ML or data science guy - 
> just regular software development background.   
> - API to extract efficiently all the text regions for some 
> multi-background and text colors samples
> I don't think that tesseract out-of-the-box has a decent text region 
> detection. That is what I'm trying to figure out in my post. Tesseract 
> folks have not responded to it yet, IDK if anyone of them is here in this 
> mail group. Looks like there are better options out there (CRAFT is just 
> one of them https://arxiv.org/abs/1904.01941)  IDK why they can not be 
> integrated into tesseract.
> - did you pass them one by one to Tesseract - yes, when CRAFT is executed 
> in https://github.com/apismensky/ocr_id/blob/main/ocr_id.py#L233 it 
> creates a bunch of crop files, for your ticket example they are in 
> https://github.com/apismensky/ocr_id/tree/main/images/boxes_craft/ticket_crops.
>  
> It also creates a text region map in 
> https://github.com/apismensky/ocr_id/blob/main/images/boxes_craft/ticket_text_detection.txt.
>  
> Most of them are rectangles (8 numbers in one row x1,y1,....x4,y4) but some 
> maybe polygones (as you can see in the other files). 
> Then I sort all the files by crop_NUMBER in  
> https://github.com/apismensky/ocr_id/blob/main/ocr_id.py#L265C6-L265C6 so 
> that they are ordered by their appearance in the original image. 
> Then I loop through all of them in 
> https://github.com/apismensky/ocr_id/blob/main/ocr_id.py#L267 and feed 
> each image to tesseract in 
> https://github.com/apismensky/ocr_id/blob/main/ocr_id.py#L270. Notice 
> that I'm using psm=7 there, cause we already know that each image is a box 
> with a single text line, and the join them together in crop_result = ' 
> '.join(res). 
> Also notice that I'm not doing any pre-processing, I wonder what the 
> result will be with some preprocessing for each image - hopefully better? 
> I have tried another approach - passing a map of text regions, detected by 
> CRAFT to tesseract, so it will not try to  do its own text detection. 
> The motivation was to reduce the number of calls to tesseract for each 
> crop (reduce the time)
> That's what .uzn files are for. 
> So for you example it will be something like: 
> tesseract ticket.png - --psm 4 -l script/Japanese
> Notice that 
> https://github.com/apismensky/ocr_id/blob/main/images/sources/ticket.uzn 
> is in the same folder as an original image, and it has the same name as the 
> file (minus extension)
> There is a little function that convert CRAFT text boxes to tesseract .uzn 
> files, https://github.com/apismensky/ocr_id/blob/main/ocr_id.py#L175
> The problem is that you can not really use pytesseract.image_to_data, I 
> assume this is because of filename mismatch: image_to_data (most probably) 
> creates a temp file in the filesystem that does not match to .uzn file 
> name. 
> So I did it by calling subprocess.check_output(command, shell=True, 
> text=True) in  
> https://github.com/apismensky/ocr_id/blob/main/ocr_id.py#L101C18-L101C73 
> to kinda manually run tesseract as an external process. As I mentioned in 
> my las message this approach did not give me the output for the regions 
> with inverted colors (white letters on black background) 
> Hopefully that makes sense, LMK of you have further questions.  
>
>
> BTW I was looking for some more or less substantial information about an 
> architecture of tesseract - at least at the level of main components, 
> pipeline, algorithms etc - could not find it.  if you (or anyone) are aware 
> - please LMK. 
> On Friday, September 8, 2023 at 6:18:42 AM UTC-6 nguyenng...@gmail.com 
> wrote:
>
>> Hi Alexey, 
>>
>> Thank you very much for trying out my sample. It is very informative to 
>> understand how CRAFT could extract correctly the text regions. As far as I 
>> know, Tesseract has a very nice Python wrapper tesserocr 
>> <https://github.com/sirfz/tesserocr/tree/9c8740dae227f60e5a3c2763783d52f19119172b>,
>>  
>> which provides many easy-to-use methods to analyze the image texts with a 
>> range of PSM, and RIL modes. However, unfortunately, I was not able to find 
>> a good method from the API to extract efficiently all the text regions for 
>> some multi-background and text colors samples. 
>>
>> The results you provided are actually very promising. I have not read 
>> your code carefully yet, but may I ask that after getting all the text 
>> regions, did you pass them one by one to Tesseract or how did you get the 
>> CRAFT 
>> + crop result: ... (with accuracy 48.78)? 
>>
>> As I noticed, some lines on the sample can be noisy for the results. I 
>> think that if applying the line removal method, the results can be better.  
>> I do not quite understand the technique of creating a map of boxes using 
>> .uzn files and passing it to Tesseract, can you explain a bit further? And 
>> yes, you are right, not only the 金額 was not there, but all of the 
>> dark-background text regions are not shown in the second results ( such as 
>> 摘要 数 重 単位 単 価 金額, etc. )
>>
>> Apology for the conversation becoming longer, but your questions are yet 
>> to be answered. I am deeply interested in understanding them too. 
>>
>> Regards
>> Hai. 
>>
>> On Friday, September 8, 2023 at 5:19:26 AM UTC+9 apism...@gmail.com 
>> wrote:
>>
>>> Thanks for sharing Hai 
>>> Looks like CRAFT can detect regions despite the background: 
>>> https://github.com/apismensky/ocr_id/blob/main/images/boxes_craft/black_background_text_detection.png
>>> . 
>>> It also creates cuts for each text region which can be OCR-ed separately 
>>> and then joined together as a result.
>>> When I ran your example with 
>>> https://github.com/apismensky/ocr_id/blob/main/ocr_id.py I've got the 
>>> following output: 
>>>
>>> CRAFT + crop result: 下記 の と お り 、 御 請求 申し 上 げ ま す 。 株 式 会 社 ス キャ ナ 保 存 
>>> スト レー ジ プ ロ ジェ クト 件 名 T573-0011 2023/4/30 大 阪 市 北 区 大 深町 3-1 支払 期限 山口 銀行 本 
>>> 店 普通 1111111 グラ ン フ ロン ト 大 阪 タ ワーB 振込 先 TEL : 06-6735-8055 担当 : ICS 太 郎 
>>> 66,000 円 (税込 ) a 摘要 数 重 単位 単 価 金額 サン プル 1 1 式 32,000 32,000 サン プル 2 1 式 
>>> 18000 18,000 2,000 2,000' 8g,000' 2,000
>>> crop word_accuracy: 48.78048780487805
>>>
>>> I've tried to create a map of boxes using .uzn files and pass it to 
>>> tesseract, but results are worse: 
>>> CRAFT result: 下記 の と お り 、 御 請求 申し 上 げ ま す 。
>>>
>>> 株 式 会 社 ス キャ ナ 保 存
>>>
>>> スト レー ジ プ ロ ジェ クト
>>>
>>> 〒573-0011
>>>
>>> 2023/4/30
>>>
>>> 大 阪 市 北 区 大 深町 3-1
>>>
>>> 山口 銀行 本 店 普通 1111111
>>>
>>> グラ ン フ ロン ト 大 阪 タ ワーB
>>>
>>> TEL : 06-6735-8055
>>>
>>> 担当 : ICS 太 郎
>>>
>>> 66,000 円 (税込 )
>>>
>>> サン プル 1
>>>
>>> 1| 式
>>>
>>> 32,000
>>>
>>> 32,000
>>>
>>> サン プル 2
>>>
>>> 1| 式
>>>
>>> 18000
>>>
>>> 18,000
>>>
>>> 2,000
>>>
>>> 2,000.
>>>
>>> 8,000
>>>
>>> 8,000
>>>
>>> craft word_accuracy: 36.58536585365854. 
>>>
>>> Apparently 金額 is not there; 
>>> Sorry, my Japanese is little bit rusty :-) 
>>> I have an impression that when I pass the map with .uzn text regions to 
>>> tesseract it applies one transformation to pre-process the image, but when 
>>> I'm passing each individual images it preprocess it separately, applying 
>>> the best strategy for each region? Of course it is slower this way. 
>>>  
>>> On Wednesday, September 6, 2023 at 7:07:52 PM UTC-6 
>>> nguyenng...@gmail.com wrote:
>>>
>>>> Hi Apismensky,
>>>>
>>>> Here are the code and sample I used for preprocessing, I extracted the 
>>>> ticket region of the train ticket from a picture taken by a smartphone. 
>>>> Since the angle, distance, brightness, and many other factors can change 
>>>> the picture quality. 
>>>> I would say scanned images or fixed-position camera-taken images have 
>>>> more consistent quality. 
>>>>
>>>> Here is the original image:
>>>>
>>>> [image: sample_to_remove_lines.png]
>>>>
>>>> # TRy to remove lines
>>>> org_image = cv2.imread("/content/sample_to_remove_lines.png")
>>>> cv2_show('org_image', org_image)
>>>> gray = cv2.cvtColor(org_image,cv2.COLOR_BGR2GRAY)
>>>>
>>>> thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + 
>>>> cv2.THRESH_OTSU)[1]
>>>> cv2_show('thresh Otsu', thresh)
>>>>
>>>>
>>>> # removing noise dots.
>>>> opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, np.ones((2,2),
>>>> np.uint8))
>>>> cv2_show('opening', opening)
>>>>
>>>> thresh = opening.copy()
>>>> mask = np.zeros_like(org_image, dtype=np.uint8)
>>>>
>>>> # Extract horizontal lines
>>>> horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (60 ,1))
>>>> remove_horizontal = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, 
>>>> horizontal_kernel, iterations=2)
>>>> cnts = cv2.findContours(remove_horizontal, cv2.RETR_EXTERNAL, 
>>>> cv2.CHAIN_APPROX_SIMPLE)
>>>> cnts = cnts[0] if len(cnts) == 2 else cnts[1]
>>>> for c in cnts:
>>>>     cv2.drawContours(mask, [c], -1, (255, 255, 255), 8)
>>>> # cv2_show('mask extract horizontal lines', mask)
>>>>
>>>> # Extract vertical lines
>>>> vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1,70))
>>>> remove_vertical = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, 
>>>> vertical_kernel, iterations=2)
>>>> cnts = cv2.findContours(remove_vertical, cv2.RETR_EXTERNAL, 
>>>> cv2.CHAIN_APPROX_SIMPLE)
>>>> cnts = cnts[0] if len(cnts) == 2 else cnts[1]
>>>> for c in cnts:
>>>>     cv2.drawContours(mask, [c], -1, (255, 255, 255), 8)
>>>>
>>>> cv2_show('mask extract lines', mask)
>>>>
>>>> result = org_image.copy()
>>>> # Loop through the pixels of the original image and modify based on the 
>>>> mask
>>>> for y in range(mask.shape[0]):
>>>>     for x in range(mask.shape[1]):
>>>>         if np.all(mask[y, x] == 255):  # If pixel is white in mask
>>>>             result[y, x] = [255, 255, 255]  # Set pixel to white
>>>>
>>>> cv2_show("result", result2)
>>>>
>>>> gray = cv2.cvtColor(result2,cv2.COLOR_BGR2GRAY)
>>>> _, simple_thresh = cv2.threshold(gray, 195, 255, cv2.THRESH_BINARY)
>>>> cv2_show('simple_thresh', simple_thresh)
>>>>
>>>>
>>>> in the above code, u can ignore the cv2_show function since it is just 
>>>> my custom method for showing images. 
>>>> You can see that the idea is to remove some noise, remove lines, and 
>>>> then simple-thresh. 
>>>> [image: extracted_lines.png]
>>>>
>>>> [image: removed_lines.png]
>>>>
>>>>
>>>> [image: ready_for_locating_text_box.png]
>>>>
>>>> I would say, from this point, the AUTO_OSD mode of Tesseract PSM can 
>>>> also give the text box for the above picture, it also needs to check with 
>>>> RIL mode (maybe RIL.WORD or RIL.TEXTLINE) to get the right level of 
>>>> textboxes. 
>>>> In my opinion, the same preprocessing methods can only be applied to a 
>>>> certain group of samples. It is in fact very hard to cover all the cases.  
>>>> For example: 
>>>>
>>>> [image: black_background.png]
>>>>
>>>> I found it difficult to locate the text box where the text is white, 
>>>> and the background is dark colors. the black text on the white background 
>>>> is easy to locate and then OCR. I am not sure what are a good method to 
>>>> locate those white texts on the dark background colors.
>>>> I hope to hear your as well as others's suggestions on this matter. 
>>>>
>>>> Regards
>>>> Hai
>>>> On Wednesday, September 6, 2023 at 12:32:56 AM UTC+9 apism...@gmail.com 
>>>> wrote:
>>>>
>>>>> Hai, could you please tell me what you are doing for pre-processing? 
>>>>> Do you have any source code you can share? 
>>>>> Are those results consistently better for images scanned with 
>>>>> different quality (resolution, angles, contrast etc)? 
>>>>>
>>>>>
>>>>> On Monday, September 4, 2023 at 2:02:27 AM UTC-6 nguyenng...@gmail.com 
>>>>> wrote:
>>>>>
>>>>>> Hi, 
>>>>>> I would like to hear other's opinions on your questions too. 
>>>>>> In my case, when I try using Tesseract for Japan train tickets, I 
>>>>>> have to do a lot of steps for preprocessing (remove background colors, 
>>>>>> noise + line removal, increase contrast,  etc.) to get satisfactory 
>>>>>> results. 
>>>>>> I am sure what you are doing (locating text boxes, extracting them, 
>>>>>> and feeding them one by one to tesseract) can get better accuracy 
>>>>>> results. 
>>>>>> However, when the number of text boxes increases, it will undoubtedly 
>>>>>> affect your performance. 
>>>>>> Could you share the PSM mode for getting those text boxes' locations 
>>>>>> ?  I usually use the AUTO_OSD to get the boxes and expand them a bit 
>>>>>> at the edges before passing them to Tesseract. 
>>>>>>
>>>>>> Regards
>>>>>> Hai
>>>>>>  
>>>>>> On Saturday, September 2, 2023 at 7:03:49 AM UTC+9 apism...@gmail.com 
>>>>>> wrote:
>>>>>>
>>>>>>> I'm looking into OCR for ID cards and drivers licenses, and I found 
>>>>>>> out that tesseract performs relatively poor on ID cards, compared to 
>>>>>>> other 
>>>>>>> OCR solutions. For this original image: 
>>>>>>> https://github.com/apismensky/ocr_id/blob/main/images/boxes_easy/AR.png 
>>>>>>> the results are: 
>>>>>>>
>>>>>>> tesseract: "4d DL 999 as = Ne allo) 2NICK © , q 12 RESTR oe } lick: 
>>>>>>> 5 DD 8888888888 <(888)%20888-8888> 1234 SZ"
>>>>>>> easyocr:  '''9 , ARKANSAS DRIVER'S LICENSE CLAss D 4d DLN 999999999 
>>>>>>> 3 DOB 03/05/1960 ] 2 SCKPLE 123 NORTH STREET CITY AR 12345 ISS 4b EXP 
>>>>>>> 03/05/2018 03/05/2026 15 SEX 16 HGT 18 EYES 5'-10" BRO 9a END NONE 12 
>>>>>>> RESTR 
>>>>>>> NONE Ylck Sorble DD 8888888888 1234 THE'''
>>>>>>> google cloud vision: """SARKANSAS\nSAMPLE\nSTATE O\n9 CLASS D\n4d 
>>>>>>> DLN 9999999993 DOB 03/05/1960\nNick Sample\nDRIVER'S LICENSE\n1 
>>>>>>> SAMPLE\n2 
>>>>>>> NICK\n8 123 NORTH STREET\nCITY, AR 12345\n4a ISS\n03/05/2018\n15 SEX 16 
>>>>>>> HGT\nM\n5'-10\"\nGREAT SE\n9a END NONE\n12 RESTR NONE\n5 DD 8888888888 
>>>>>>> 1234\n4b EXP\n03/05/2026 MS60\n18 EYES\nBRO\nRKANSAS\n0"""
>>>>>>>
>>>>>>> and word accuracy is:
>>>>>>>
>>>>>>>              tesseract  |  easyocr  |  google
>>>>>>> words         10.34%    |  68.97%   |  82.76%
>>>>>>>
>>>>>>> This is "out if the box" performance, without any preprocessing. I'm 
>>>>>>> not surprised that google vision is that good compared to others, but 
>>>>>>> easyocr, which is another open source solution performs much better 
>>>>>>> than 
>>>>>>> tesseract is this case. I have the whole project dedicated to this, and 
>>>>>>> all 
>>>>>>> other results are much better for easyocr: 
>>>>>>> https://github.com/apismensky/ocr_id/blob/main/result.json, all 
>>>>>>> input files are files in 
>>>>>>> https://github.com/apismensky/ocr_id/tree/main/images/sources
>>>>>>> After digging into it for a little bit, I suspect that bounding box 
>>>>>>> detection is much better in google (
>>>>>>> https://github.com/apismensky/ocr_id/blob/main/images/boxes_google/AR.png)
>>>>>>>  
>>>>>>> and easyocr (
>>>>>>> https://github.com/apismensky/ocr_id/blob/main/images/boxes_easy/AR.png),
>>>>>>>  
>>>>>>> than in tesseract (
>>>>>>> https://github.com/apismensky/ocr_id/blob/main/images/boxes_tesseract/AR.png).
>>>>>>>  
>>>>>>>
>>>>>>> I'm pretty sure, about this, cause when I manually cut the text 
>>>>>>> boxes and feed them to tesseract it works much better. 
>>>>>>>
>>>>>>>
>>>>>>> Now questions: 
>>>>>>>
>>>>>>> - What is the part of the codebase in tesseract that is responsible 
>>>>>>> for text detection and which algorithm is it using? 
>>>>>>> - What is impacting bounding box detection in tesseract so it fails 
>>>>>>> on these types of images (complex layouts / background noise... etc)
>>>>>>> - Is it possible to use the same text detection procedure as easyocr 
>>>>>>> or improve the existing one?  
>>>>>>> - Maybe possible to switch text detection algo based on the image 
>>>>>>> type or make it pluggable where user can configure from several options 
>>>>>>> A,B,C...
>>>>>>>
>>>>>>>
>>>>>>> Thanks. 
>>>>>>>
>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/98cdcf88-7e16-41f8-be77-1a66ce354709n%40googlegroups.com.

[tesseract-ocr] Re: Tesseract performance On ID cards and passports

Reply via email to