[tesseract-ocr] Re: Tesseract performance On ID cards and passports

Alexey Pismenskiy Tue, 05 Sep 2023 08:17:30 -0700

These results are for PSM=1, I think I have tried other values, but I 
haven't notice any improvements. 
https://github.com/apismensky/ocr_id/blob/main/ocr_id.py#L81


Regards, 
Alexey

On Monday, September 4, 2023 at 2:02:27 AM UTC-6 nguyenng...@gmail.com 
wrote:

> Hi, 
> I would like to hear other's opinions on your questions too. 
> In my case, when I try using Tesseract for Japan train tickets, I have to 
> do a lot of steps for preprocessing (remove background colors, noise + line 
> removal, increase contrast,  etc.) to get satisfactory results. 
> I am sure what you are doing (locating text boxes, extracting them, and 
> feeding them one by one to tesseract) can get better accuracy results. 
> However, when the number of text boxes increases, it will undoubtedly 
> affect your performance. 
> Could you share the PSM mode for getting those text boxes' locations ?  I 
> usually use the AUTO_OSD to get the boxes and expand them a bit at the 
> edges before passing them to Tesseract. 
>
> Regards
> Hai
>  
> On Saturday, September 2, 2023 at 7:03:49 AM UTC+9 apism...@gmail.com 
> wrote:
>
>> I'm looking into OCR for ID cards and drivers licenses, and I found out 
>> that tesseract performs relatively poor on ID cards, compared to other OCR 
>> solutions. For this original image: 
>> https://github.com/apismensky/ocr_id/blob/main/images/boxes_easy/AR.png 
>> the results are: 
>>
>> tesseract: "4d DL 999 as = Ne allo) 2NICK © , q 12 RESTR oe } lick: 5 DD 
>> 8888888888 <(888)%20888-8888> 1234 SZ"
>> easyocr:  '''9 , ARKANSAS DRIVER'S LICENSE CLAss D 4d DLN 999999999 3 DOB 
>> 03/05/1960 ] 2 SCKPLE 123 NORTH STREET CITY AR 12345 ISS 4b EXP 03/05/2018 
>> 03/05/2026 15 SEX 16 HGT 18 EYES 5'-10" BRO 9a END NONE 12 RESTR NONE Ylck 
>> Sorble DD 8888888888 1234 THE'''
>> google cloud vision: """SARKANSAS\nSAMPLE\nSTATE O\n9 CLASS D\n4d DLN 
>> 9999999993 DOB 03/05/1960\nNick Sample\nDRIVER'S LICENSE\n1 SAMPLE\n2 
>> NICK\n8 123 NORTH STREET\nCITY, AR 12345\n4a ISS\n03/05/2018\n15 SEX 16 
>> HGT\nM\n5'-10\"\nGREAT SE\n9a END NONE\n12 RESTR NONE\n5 DD 8888888888 
>> 1234\n4b EXP\n03/05/2026 MS60\n18 EYES\nBRO\nRKANSAS\n0"""
>>
>> and word accuracy is:
>>
>>              tesseract  |  easyocr  |  google
>> words         10.34%    |  68.97%   |  82.76%
>>
>> This is "out if the box" performance, without any preprocessing. I'm not 
>> surprised that google vision is that good compared to others, but easyocr, 
>> which is another open source solution performs much better than tesseract 
>> is this case. I have the whole project dedicated to this, and all other 
>> results are much better for easyocr: 
>> https://github.com/apismensky/ocr_id/blob/main/result.json, all input 
>> files are files in 
>> https://github.com/apismensky/ocr_id/tree/main/images/sources
>> After digging into it for a little bit, I suspect that bounding box 
>> detection is much better in google (
>> https://github.com/apismensky/ocr_id/blob/main/images/boxes_google/AR.png) 
>> and easyocr (
>> https://github.com/apismensky/ocr_id/blob/main/images/boxes_easy/AR.png), 
>> than in tesseract (
>> https://github.com/apismensky/ocr_id/blob/main/images/boxes_tesseract/AR.png).
>>  
>>
>> I'm pretty sure, about this, cause when I manually cut the text boxes and 
>> feed them to tesseract it works much better. 
>>
>>
>> Now questions: 
>>
>> - What is the part of the codebase in tesseract that is responsible for 
>> text detection and which algorithm is it using? 
>> - What is impacting bounding box detection in tesseract so it fails on 
>> these types of images (complex layouts / background noise... etc)
>> - Is it possible to use the same text detection procedure as easyocr or 
>> improve the existing one?  
>> - Maybe possible to switch text detection algo based on the image type or 
>> make it pluggable where user can configure from several options A,B,C...
>>
>>
>> Thanks. 
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/44a60dea-f76b-409d-8bff-b764427700c2n%40googlegroups.com.

[tesseract-ocr] Re: Tesseract performance On ID cards and passports

Reply via email to