[tesseract-ocr] Re: Tesseract performance On ID cards and passports

Alexey Pismenskiy Fri, 08 Sep 2023 08:28:00 -0700

Hai, sorry I have missed a lot of details in my last message, so I will try
to clarify.
Disclaimer: I'm not a computer vision guru, nor a ML or data science guy -
just regular software development background.
- API to extract efficiently all the text regions for some multi-background
and text colors samples
I don't think that tesseract out-of-the-box has a decent text region
detection. That is what I'm trying to figure out in my post. Tesseract
folks have not responded to it yet, IDK if anyone of them is here in this
mail group. Looks like there are better options out there (CRAFT is just
one of them https://arxiv.org/abs/1904.01941) IDK why they can not be
integrated into tesseract.
- did you pass them one by one to Tesseract - yes, when CRAFT is executed
in https://github.com/apismensky/ocr_id/blob/main/ocr_id.py#L233 it creates
a bunch of crop files, for your ticket example they are
in
https://github.com/apismensky/ocr_id/tree/main/images/boxes_craft/ticket_crops.
It also creates a text region map
in
https://github.com/apismensky/ocr_id/blob/main/images/boxes_craft/ticket_text_detection.txt.

Most of them are rectangles (8 numbers in one row x1,y1,....x4,y4) but some
maybe polygones (as you can see in the other files).
Then I sort all the files by crop_NUMBER in
https://github.com/apismensky/ocr_id/blob/main/ocr_id.py#L265C6-L265C6 so
that they are ordered by their appearance in the original image.
Then I loop through all of them
in https://github.com/apismensky/ocr_id/blob/main/ocr_id.py#L267 and feed
each image to tesseract
in https://github.com/apismensky/ocr_id/blob/main/ocr_id.py#L270. Notice
that I'm using psm=7 there, cause we already know that each image is a box
with a single text line, and the join them together in crop_result = '
'.join(res).
Also notice that I'm not doing any pre-processing, I wonder what the result
will be with some preprocessing for each image - hopefully better?
I have tried another approach - passing a map of text regions, detected by
CRAFT to tesseract, so it will not try to do its own text detection.
The motivation was to reduce the number of calls to tesseract for each crop
(reduce the time)
That's what .uzn files are for.
So for you example it will be something like:
tesseract ticket.png - --psm 4 -l script/Japanese
Notice
that https://github.com/apismensky/ocr_id/blob/main/images/sources/ticket.uzn
is in the same folder as an original image, and it has the same name as the
file (minus extension)
There is a little function that convert CRAFT text boxes to tesseract .uzn
files, https://github.com/apismensky/ocr_id/blob/main/ocr_id.py#L175
The problem is that you can not really use pytesseract.image_to_data, I
assume this is because of filename mismatch: image_to_data (most probably)
creates a temp file in the filesystem that does not match to .uzn file
name.
So I did it by calling subprocess.check_output(command, shell=True,
text=True)
in https://github.com/apismensky/ocr_id/blob/main/ocr_id.py#L101C18-L101C73
to kinda manually run tesseract as an external process. As I mentioned in
my las message this approach did not give me the output for the regions
with inverted colors (white letters on black background)
Hopefully that makes sense, LMK of you have further questions.



BTW I was looking for some more or less substantial information about an 
architecture of tesseract - at least at the level of main components, 
pipeline, algorithms etc - could not find it.  if you (or anyone) are aware 
- please LMK. 
On Friday, September 8, 2023 at 6:18:42 AM UTC-6 nguyenng...@gmail.com 
wrote:

> Hi Alexey, 
>
> Thank you very much for trying out my sample. It is very informative to 
> understand how CRAFT could extract correctly the text regions. As far as I 
> know, Tesseract has a very nice Python wrapper tesserocr 
> <https://github.com/sirfz/tesserocr/tree/9c8740dae227f60e5a3c2763783d52f19119172b>,
>  
> which provides many easy-to-use methods to analyze the image texts with a 
> range of PSM, and RIL modes. However, unfortunately, I was not able to find 
> a good method from the API to extract efficiently all the text regions for 
> some multi-background and text colors samples. 
>
> The results you provided are actually very promising. I have not read your 
> code carefully yet, but may I ask that after getting all the text regions, 
> did you pass them one by one to Tesseract or how did you get the CRAFT + 
> crop result: ... (with accuracy 48.78)? 
>
> As I noticed, some lines on the sample can be noisy for the results. I 
> think that if applying the line removal method, the results can be better.  
> I do not quite understand the technique of creating a map of boxes using 
> .uzn files and passing it to Tesseract, can you explain a bit further? And 
> yes, you are right, not only the 金額 was not there, but all of the 
> dark-background text regions are not shown in the second results ( such as 
> 摘要 数 重 単位 単 価 金額, etc. )
>
> Apology for the conversation becoming longer, but your questions are yet 
> to be answered. I am deeply interested in understanding them too. 
>
> Regards
> Hai. 
>
> On Friday, September 8, 2023 at 5:19:26 AM UTC+9 apism...@gmail.com wrote:
>
>> Thanks for sharing Hai 
>> Looks like CRAFT can detect regions despite the background: 
>> https://github.com/apismensky/ocr_id/blob/main/images/boxes_craft/black_background_text_detection.png
>> . 
>> It also creates cuts for each text region which can be OCR-ed separately 
>> and then joined together as a result.
>> When I ran your example with 
>> https://github.com/apismensky/ocr_id/blob/main/ocr_id.py I've got the 
>> following output: 
>>
>> CRAFT + crop result: 下記 の と お り 、 御 請求 申し 上 げ ま す 。 株 式 会 社 ス キャ ナ 保 存 スト 
>> レー ジ プ ロ ジェ クト 件 名 T573-0011 2023/4/30 大 阪 市 北 区 大 深町 3-1 支払 期限 山口 銀行 本 店 
>> 普通 1111111 グラ ン フ ロン ト 大 阪 タ ワーB 振込 先 TEL : 06-6735-8055 担当 : ICS 太 郎 
>> 66,000 円 (税込 ) a 摘要 数 重 単位 単 価 金額 サン プル 1 1 式 32,000 32,000 サン プル 2 1 式 
>> 18000 18,000 2,000 2,000' 8g,000' 2,000
>> crop word_accuracy: 48.78048780487805
>>
>> I've tried to create a map of boxes using .uzn files and pass it to 
>> tesseract, but results are worse: 
>> CRAFT result: 下記 の と お り 、 御 請求 申し 上 げ ま す 。
>>
>> 株 式 会 社 ス キャ ナ 保 存
>>
>> スト レー ジ プ ロ ジェ クト
>>
>> 〒573-0011
>>
>> 2023/4/30
>>
>> 大 阪 市 北 区 大 深町 3-1
>>
>> 山口 銀行 本 店 普通 1111111
>>
>> グラ ン フ ロン ト 大 阪 タ ワーB
>>
>> TEL : 06-6735-8055
>>
>> 担当 : ICS 太 郎
>>
>> 66,000 円 (税込 )
>>
>> サン プル 1
>>
>> 1| 式
>>
>> 32,000
>>
>> 32,000
>>
>> サン プル 2
>>
>> 1| 式
>>
>> 18000
>>
>> 18,000
>>
>> 2,000
>>
>> 2,000.
>>
>> 8,000
>>
>> 8,000
>>
>> craft word_accuracy: 36.58536585365854. 
>>
>> Apparently 金額 is not there; 
>> Sorry, my Japanese is little bit rusty :-) 
>> I have an impression that when I pass the map with .uzn text regions to 
>> tesseract it applies one transformation to pre-process the image, but when 
>> I'm passing each individual images it preprocess it separately, applying 
>> the best strategy for each region? Of course it is slower this way. 
>>  
>> On Wednesday, September 6, 2023 at 7:07:52 PM UTC-6 nguyenng...@gmail.com 
>> wrote:
>>
>>> Hi Apismensky,
>>>
>>> Here are the code and sample I used for preprocessing, I extracted the 
>>> ticket region of the train ticket from a picture taken by a smartphone. 
>>> Since the angle, distance, brightness, and many other factors can change 
>>> the picture quality. 
>>> I would say scanned images or fixed-position camera-taken images have 
>>> more consistent quality. 
>>>
>>> Here is the original image:
>>>
>>> [image: sample_to_remove_lines.png]
>>>
>>> # TRy to remove lines
>>> org_image = cv2.imread("/content/sample_to_remove_lines.png")
>>> cv2_show('org_image', org_image)
>>> gray = cv2.cvtColor(org_image,cv2.COLOR_BGR2GRAY)
>>>
>>> thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + 
>>> cv2.THRESH_OTSU)[1]
>>> cv2_show('thresh Otsu', thresh)
>>>
>>>
>>> # removing noise dots.
>>> opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, np.ones((2,2),
>>> np.uint8))
>>> cv2_show('opening', opening)
>>>
>>> thresh = opening.copy()
>>> mask = np.zeros_like(org_image, dtype=np.uint8)
>>>
>>> # Extract horizontal lines
>>> horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (60 ,1))
>>> remove_horizontal = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, 
>>> horizontal_kernel, iterations=2)
>>> cnts = cv2.findContours(remove_horizontal, cv2.RETR_EXTERNAL, 
>>> cv2.CHAIN_APPROX_SIMPLE)
>>> cnts = cnts[0] if len(cnts) == 2 else cnts[1]
>>> for c in cnts:
>>>     cv2.drawContours(mask, [c], -1, (255, 255, 255), 8)
>>> # cv2_show('mask extract horizontal lines', mask)
>>>
>>> # Extract vertical lines
>>> vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1,70))
>>> remove_vertical = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, 
>>> vertical_kernel, iterations=2)
>>> cnts = cv2.findContours(remove_vertical, cv2.RETR_EXTERNAL, 
>>> cv2.CHAIN_APPROX_SIMPLE)
>>> cnts = cnts[0] if len(cnts) == 2 else cnts[1]
>>> for c in cnts:
>>>     cv2.drawContours(mask, [c], -1, (255, 255, 255), 8)
>>>
>>> cv2_show('mask extract lines', mask)
>>>
>>> result = org_image.copy()
>>> # Loop through the pixels of the original image and modify based on the 
>>> mask
>>> for y in range(mask.shape[0]):
>>>     for x in range(mask.shape[1]):
>>>         if np.all(mask[y, x] == 255):  # If pixel is white in mask
>>>             result[y, x] = [255, 255, 255]  # Set pixel to white
>>>
>>> cv2_show("result", result2)
>>>
>>> gray = cv2.cvtColor(result2,cv2.COLOR_BGR2GRAY)
>>> _, simple_thresh = cv2.threshold(gray, 195, 255, cv2.THRESH_BINARY)
>>> cv2_show('simple_thresh', simple_thresh)
>>>
>>>
>>> in the above code, u can ignore the cv2_show function since it is just 
>>> my custom method for showing images. 
>>> You can see that the idea is to remove some noise, remove lines, and 
>>> then simple-thresh. 
>>> [image: extracted_lines.png]
>>>
>>> [image: removed_lines.png]
>>>
>>>
>>> [image: ready_for_locating_text_box.png]
>>>
>>> I would say, from this point, the AUTO_OSD mode of Tesseract PSM can 
>>> also give the text box for the above picture, it also needs to check with 
>>> RIL mode (maybe RIL.WORD or RIL.TEXTLINE) to get the right level of 
>>> textboxes. 
>>> In my opinion, the same preprocessing methods can only be applied to a 
>>> certain group of samples. It is in fact very hard to cover all the cases.  
>>> For example: 
>>>
>>> [image: black_background.png]
>>>
>>> I found it difficult to locate the text box where the text is white, and 
>>> the background is dark colors. the black text on the white background is 
>>> easy to locate and then OCR. I am not sure what are a good method to locate 
>>> those white texts on the dark background colors.
>>> I hope to hear your as well as others's suggestions on this matter. 
>>>
>>> Regards
>>> Hai
>>> On Wednesday, September 6, 2023 at 12:32:56 AM UTC+9 apism...@gmail.com 
>>> wrote:
>>>
>>>> Hai, could you please tell me what you are doing for pre-processing? 
>>>> Do you have any source code you can share? 
>>>> Are those results consistently better for images scanned with different 
>>>> quality (resolution, angles, contrast etc)? 
>>>>
>>>>
>>>> On Monday, September 4, 2023 at 2:02:27 AM UTC-6 nguyenng...@gmail.com 
>>>> wrote:
>>>>
>>>>> Hi, 
>>>>> I would like to hear other's opinions on your questions too. 
>>>>> In my case, when I try using Tesseract for Japan train tickets, I have 
>>>>> to do a lot of steps for preprocessing (remove background colors, noise + 
>>>>> line removal, increase contrast,  etc.) to get satisfactory results. 
>>>>> I am sure what you are doing (locating text boxes, extracting them, 
>>>>> and feeding them one by one to tesseract) can get better accuracy 
>>>>> results. 
>>>>> However, when the number of text boxes increases, it will undoubtedly 
>>>>> affect your performance. 
>>>>> Could you share the PSM mode for getting those text boxes' locations 
>>>>> ?  I usually use the AUTO_OSD to get the boxes and expand them a bit 
>>>>> at the edges before passing them to Tesseract. 
>>>>>
>>>>> Regards
>>>>> Hai
>>>>>  
>>>>> On Saturday, September 2, 2023 at 7:03:49 AM UTC+9 apism...@gmail.com 
>>>>> wrote:
>>>>>
>>>>>> I'm looking into OCR for ID cards and drivers licenses, and I found 
>>>>>> out that tesseract performs relatively poor on ID cards, compared to 
>>>>>> other 
>>>>>> OCR solutions. For this original image: 
>>>>>> https://github.com/apismensky/ocr_id/blob/main/images/boxes_easy/AR.png 
>>>>>> the results are: 
>>>>>>
>>>>>> tesseract: "4d DL 999 as = Ne allo) 2NICK © , q 12 RESTR oe } lick: 5 
>>>>>> DD 8888888888 <(888)%20888-8888> 1234 SZ"
>>>>>> easyocr:  '''9 , ARKANSAS DRIVER'S LICENSE CLAss D 4d DLN 999999999 3 
>>>>>> DOB 03/05/1960 ] 2 SCKPLE 123 NORTH STREET CITY AR 12345 ISS 4b EXP 
>>>>>> 03/05/2018 03/05/2026 15 SEX 16 HGT 18 EYES 5'-10" BRO 9a END NONE 12 
>>>>>> RESTR 
>>>>>> NONE Ylck Sorble DD 8888888888 1234 THE'''
>>>>>> google cloud vision: """SARKANSAS\nSAMPLE\nSTATE O\n9 CLASS D\n4d DLN 
>>>>>> 9999999993 DOB 03/05/1960\nNick Sample\nDRIVER'S LICENSE\n1 SAMPLE\n2 
>>>>>> NICK\n8 123 NORTH STREET\nCITY, AR 12345\n4a ISS\n03/05/2018\n15 SEX 16 
>>>>>> HGT\nM\n5'-10\"\nGREAT SE\n9a END NONE\n12 RESTR NONE\n5 DD 8888888888 
>>>>>> 1234\n4b EXP\n03/05/2026 MS60\n18 EYES\nBRO\nRKANSAS\n0"""
>>>>>>
>>>>>> and word accuracy is:
>>>>>>
>>>>>>              tesseract  |  easyocr  |  google
>>>>>> words         10.34%    |  68.97%   |  82.76%
>>>>>>
>>>>>> This is "out if the box" performance, without any preprocessing. I'm 
>>>>>> not surprised that google vision is that good compared to others, but 
>>>>>> easyocr, which is another open source solution performs much better than 
>>>>>> tesseract is this case. I have the whole project dedicated to this, and 
>>>>>> all 
>>>>>> other results are much better for easyocr: 
>>>>>> https://github.com/apismensky/ocr_id/blob/main/result.json, all 
>>>>>> input files are files in 
>>>>>> https://github.com/apismensky/ocr_id/tree/main/images/sources
>>>>>> After digging into it for a little bit, I suspect that bounding box 
>>>>>> detection is much better in google (
>>>>>> https://github.com/apismensky/ocr_id/blob/main/images/boxes_google/AR.png)
>>>>>>  
>>>>>> and easyocr (
>>>>>> https://github.com/apismensky/ocr_id/blob/main/images/boxes_easy/AR.png),
>>>>>>  
>>>>>> than in tesseract (
>>>>>> https://github.com/apismensky/ocr_id/blob/main/images/boxes_tesseract/AR.png).
>>>>>>  
>>>>>>
>>>>>> I'm pretty sure, about this, cause when I manually cut the text boxes 
>>>>>> and feed them to tesseract it works much better. 
>>>>>>
>>>>>>
>>>>>> Now questions: 
>>>>>>
>>>>>> - What is the part of the codebase in tesseract that is responsible 
>>>>>> for text detection and which algorithm is it using? 
>>>>>> - What is impacting bounding box detection in tesseract so it fails 
>>>>>> on these types of images (complex layouts / background noise... etc)
>>>>>> - Is it possible to use the same text detection procedure as easyocr 
>>>>>> or improve the existing one?  
>>>>>> - Maybe possible to switch text detection algo based on the image 
>>>>>> type or make it pluggable where user can configure from several options 
>>>>>> A,B,C...
>>>>>>
>>>>>>
>>>>>> Thanks. 
>>>>>>
>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/cf6b78a8-faec-4c77-9630-154fe9a7ac67n%40googlegroups.com.

[tesseract-ocr] Re: Tesseract performance On ID cards and passports

Reply via email to