I'm looking into OCR for ID cards and drivers licenses, and I found out that tesseract performs relatively poor on ID cards, compared to other OCR solutions. For this original image: https://github.com/apismensky/ocr_id/blob/main/images/boxes_easy/AR.png the results are:
tesseract: "4d DL 999 as = Ne allo) 2NICK © , q 12 RESTR oe } lick: 5 DD 8888888888 1234 SZ" easyocr: '''9 , ARKANSAS DRIVER'S LICENSE CLAss D 4d DLN 999999999 3 DOB 03/05/1960 ] 2 SCKPLE 123 NORTH STREET CITY AR 12345 ISS 4b EXP 03/05/2018 03/05/2026 15 SEX 16 HGT 18 EYES 5'-10" BRO 9a END NONE 12 RESTR NONE Ylck Sorble DD 8888888888 1234 THE''' google cloud vision: """SARKANSAS\nSAMPLE\nSTATE O\n9 CLASS D\n4d DLN 9999999993 DOB 03/05/1960\nNick Sample\nDRIVER'S LICENSE\n1 SAMPLE\n2 NICK\n8 123 NORTH STREET\nCITY, AR 12345\n4a ISS\n03/05/2018\n15 SEX 16 HGT\nM\n5'-10\"\nGREAT SE\n9a END NONE\n12 RESTR NONE\n5 DD 8888888888 1234\n4b EXP\n03/05/2026 MS60\n18 EYES\nBRO\nRKANSAS\n0""" and word accuracy is: tesseract | easyocr | google words 10.34% | 68.97% | 82.76% This is "out if the box" performance, without any preprocessing. I'm not surprised that google vision is that good compared to others, but easyocr, which is another open source solution performs much better than tesseract is this case. I have the whole project dedicated to this, and all other results are much better for easyocr: https://github.com/apismensky/ocr_id/blob/main/result.json, all input files are files in https://github.com/apismensky/ocr_id/tree/main/images/sources After digging into it for a little bit, I suspect that bounding box detection is much better in google (https://github.com/apismensky/ocr_id/blob/main/images/boxes_google/AR.png) and easyocr (https://github.com/apismensky/ocr_id/blob/main/images/boxes_easy/AR.png), than in tesseract (https://github.com/apismensky/ocr_id/blob/main/images/boxes_tesseract/AR.png). I'm pretty sure, about this, cause when I manually cut the text boxes and feed them to tesseract it works much better. Now questions: - What is the part of the codebase in tesseract that is responsible for text detection and which algorithm is it using? - What is impacting bounding box detection in tesseract so it fails on these types of images (complex layouts / background noise... etc) - Is it possible to use the same text detection procedure as easyocr or improve the existing one? - Maybe possible to switch text detection algo based on the image type or make it pluggable where user can configure from several options A,B,C... Thanks. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4f213209-cdee-4d73-a838-1aac4bb0b9afn%40googlegroups.com.