I'm looking into OCR for ID cards and drivers licenses, and I found out 
that tesseract performs relatively poor on ID cards, compared to other OCR 
solutions. For this original image: 
https://github.com/apismensky/ocr_id/blob/main/images/boxes_easy/AR.png the 
results are: 

tesseract: "4d DL 999 as = Ne allo) 2NICK © , q 12 RESTR oe } lick: 5 DD 
8888888888 1234 SZ"
easyocr:  '''9 , ARKANSAS DRIVER'S LICENSE CLAss D 4d DLN 999999999 3 DOB 
03/05/1960 ] 2 SCKPLE 123 NORTH STREET CITY AR 12345 ISS 4b EXP 03/05/2018 
03/05/2026 15 SEX 16 HGT 18 EYES 5'-10" BRO 9a END NONE 12 RESTR NONE Ylck 
Sorble DD 8888888888 1234 THE'''
google cloud vision: """SARKANSAS\nSAMPLE\nSTATE O\n9 CLASS D\n4d DLN 
9999999993 DOB 03/05/1960\nNick Sample\nDRIVER'S LICENSE\n1 SAMPLE\n2 
NICK\n8 123 NORTH STREET\nCITY, AR 12345\n4a ISS\n03/05/2018\n15 SEX 16 
HGT\nM\n5'-10\"\nGREAT SE\n9a END NONE\n12 RESTR NONE\n5 DD 8888888888 
1234\n4b EXP\n03/05/2026 MS60\n18 EYES\nBRO\nRKANSAS\n0"""

and word accuracy is:

             tesseract  |  easyocr  |  google
words         10.34%    |  68.97%   |  82.76%

This is "out if the box" performance, without any preprocessing. I'm not 
surprised that google vision is that good compared to others, but easyocr, 
which is another open source solution performs much better than tesseract 
is this case. I have the whole project dedicated to this, and all other 
results are much better for easyocr: 
https://github.com/apismensky/ocr_id/blob/main/result.json, all input files 
are files in https://github.com/apismensky/ocr_id/tree/main/images/sources
After digging into it for a little bit, I suspect that bounding box 
detection is much better in google 
(https://github.com/apismensky/ocr_id/blob/main/images/boxes_google/AR.png) 
and easyocr 
(https://github.com/apismensky/ocr_id/blob/main/images/boxes_easy/AR.png), 
than in tesseract 
(https://github.com/apismensky/ocr_id/blob/main/images/boxes_tesseract/AR.png). 

I'm pretty sure, about this, cause when I manually cut the text boxes and 
feed them to tesseract it works much better. 


Now questions: 

- What is the part of the codebase in tesseract that is responsible for 
text detection and which algorithm is it using? 
- What is impacting bounding box detection in tesseract so it fails on 
these types of images (complex layouts / background noise... etc)
- Is it possible to use the same text detection procedure as easyocr or 
improve the existing one?  
- Maybe possible to switch text detection algo based on the image type or 
make it pluggable where user can configure from several options A,B,C...


Thanks. 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4f213209-cdee-4d73-a838-1aac4bb0b9afn%40googlegroups.com.

Reply via email to