Hi Alexey, Thank you very much for trying out my sample. It is very informative to understand how CRAFT could extract correctly the text regions. As far as I know, Tesseract has a very nice Python wrapper tesserocr <https://github.com/sirfz/tesserocr/tree/9c8740dae227f60e5a3c2763783d52f19119172b>, which provides many easy-to-use methods to analyze the image texts with a range of PSM, and RIL modes. However, unfortunately, I was not able to find a good method from the API to extract efficiently all the text regions for some multi-background and text colors samples.
The results you provided are actually very promising. I have not read your code carefully yet, but may I ask that after getting all the text regions, did you pass them one by one to Tesseract or how did you get the CRAFT + crop result: ... (with accuracy 48.78)? As I noticed, some lines on the sample can be noisy for the results. I think that if applying the line removal method, the results can be better. I do not quite understand the technique of creating a map of boxes using .uzn files and passing it to Tesseract, can you explain a bit further? And yes, you are right, not only the 金額 was not there, but all of the dark-background text regions are not shown in the second results ( such as 摘要 数 重 単位 単 価 金額, etc. ) Apology for the conversation becoming longer, but your questions are yet to be answered. I am deeply interested in understanding them too. Regards Hai. On Friday, September 8, 2023 at 5:19:26 AM UTC+9 apism...@gmail.com wrote: > Thanks for sharing Hai > Looks like CRAFT can detect regions despite the background: > https://github.com/apismensky/ocr_id/blob/main/images/boxes_craft/black_background_text_detection.png > . > It also creates cuts for each text region which can be OCR-ed separately > and then joined together as a result. > When I ran your example with > https://github.com/apismensky/ocr_id/blob/main/ocr_id.py I've got the > following output: > > CRAFT + crop result: 下記 の と お り 、 御 請求 申し 上 げ ま す 。 株 式 会 社 ス キャ ナ 保 存 スト > レー ジ プ ロ ジェ クト 件 名 T573-0011 2023/4/30 大 阪 市 北 区 大 深町 3-1 支払 期限 山口 銀行 本 店 > 普通 1111111 グラ ン フ ロン ト 大 阪 タ ワーB 振込 先 TEL : 06-6735-8055 担当 : ICS 太 郎 > 66,000 円 (税込 ) a 摘要 数 重 単位 単 価 金額 サン プル 1 1 式 32,000 32,000 サン プル 2 1 式 > 18000 18,000 2,000 2,000' 8g,000' 2,000 > crop word_accuracy: 48.78048780487805 > > I've tried to create a map of boxes using .uzn files and pass it to > tesseract, but results are worse: > CRAFT result: 下記 の と お り 、 御 請求 申し 上 げ ま す 。 > > 株 式 会 社 ス キャ ナ 保 存 > > スト レー ジ プ ロ ジェ クト > > 〒573-0011 > > 2023/4/30 > > 大 阪 市 北 区 大 深町 3-1 > > 山口 銀行 本 店 普通 1111111 > > グラ ン フ ロン ト 大 阪 タ ワーB > > TEL : 06-6735-8055 > > 担当 : ICS 太 郎 > > 66,000 円 (税込 ) > > サン プル 1 > > 1| 式 > > 32,000 > > 32,000 > > サン プル 2 > > 1| 式 > > 18000 > > 18,000 > > 2,000 > > 2,000. > > 8,000 > > 8,000 > > craft word_accuracy: 36.58536585365854. > > Apparently 金額 is not there; > Sorry, my Japanese is little bit rusty :-) > I have an impression that when I pass the map with .uzn text regions to > tesseract it applies one transformation to pre-process the image, but when > I'm passing each individual images it preprocess it separately, applying > the best strategy for each region? Of course it is slower this way. > > On Wednesday, September 6, 2023 at 7:07:52 PM UTC-6 nguyenng...@gmail.com > wrote: > >> Hi Apismensky, >> >> Here are the code and sample I used for preprocessing, I extracted the >> ticket region of the train ticket from a picture taken by a smartphone. >> Since the angle, distance, brightness, and many other factors can change >> the picture quality. >> I would say scanned images or fixed-position camera-taken images have >> more consistent quality. >> >> Here is the original image: >> >> [image: sample_to_remove_lines.png] >> >> # TRy to remove lines >> org_image = cv2.imread("/content/sample_to_remove_lines.png") >> cv2_show('org_image', org_image) >> gray = cv2.cvtColor(org_image,cv2.COLOR_BGR2GRAY) >> >> thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + >> cv2.THRESH_OTSU)[1] >> cv2_show('thresh Otsu', thresh) >> >> >> # removing noise dots. >> opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, np.ones((2,2),np.uint8 >> )) >> cv2_show('opening', opening) >> >> thresh = opening.copy() >> mask = np.zeros_like(org_image, dtype=np.uint8) >> >> # Extract horizontal lines >> horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (60 ,1)) >> remove_horizontal = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, >> horizontal_kernel, iterations=2) >> cnts = cv2.findContours(remove_horizontal, cv2.RETR_EXTERNAL, >> cv2.CHAIN_APPROX_SIMPLE) >> cnts = cnts[0] if len(cnts) == 2 else cnts[1] >> for c in cnts: >> cv2.drawContours(mask, [c], -1, (255, 255, 255), 8) >> # cv2_show('mask extract horizontal lines', mask) >> >> # Extract vertical lines >> vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1,70)) >> remove_vertical = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, >> vertical_kernel, iterations=2) >> cnts = cv2.findContours(remove_vertical, cv2.RETR_EXTERNAL, >> cv2.CHAIN_APPROX_SIMPLE) >> cnts = cnts[0] if len(cnts) == 2 else cnts[1] >> for c in cnts: >> cv2.drawContours(mask, [c], -1, (255, 255, 255), 8) >> >> cv2_show('mask extract lines', mask) >> >> result = org_image.copy() >> # Loop through the pixels of the original image and modify based on the >> mask >> for y in range(mask.shape[0]): >> for x in range(mask.shape[1]): >> if np.all(mask[y, x] == 255): # If pixel is white in mask >> result[y, x] = [255, 255, 255] # Set pixel to white >> >> cv2_show("result", result2) >> >> gray = cv2.cvtColor(result2,cv2.COLOR_BGR2GRAY) >> _, simple_thresh = cv2.threshold(gray, 195, 255, cv2.THRESH_BINARY) >> cv2_show('simple_thresh', simple_thresh) >> >> >> in the above code, u can ignore the cv2_show function since it is just my >> custom method for showing images. >> You can see that the idea is to remove some noise, remove lines, and then >> simple-thresh. >> [image: extracted_lines.png] >> >> [image: removed_lines.png] >> >> >> [image: ready_for_locating_text_box.png] >> >> I would say, from this point, the AUTO_OSD mode of Tesseract PSM can also >> give the text box for the above picture, it also needs to check with RIL >> mode (maybe RIL.WORD or RIL.TEXTLINE) to get the right level of textboxes. >> In my opinion, the same preprocessing methods can only be applied to a >> certain group of samples. It is in fact very hard to cover all the cases. >> For example: >> >> [image: black_background.png] >> >> I found it difficult to locate the text box where the text is white, and >> the background is dark colors. the black text on the white background is >> easy to locate and then OCR. I am not sure what are a good method to locate >> those white texts on the dark background colors. >> I hope to hear your as well as others's suggestions on this matter. >> >> Regards >> Hai >> On Wednesday, September 6, 2023 at 12:32:56 AM UTC+9 apism...@gmail.com >> wrote: >> >>> Hai, could you please tell me what you are doing for pre-processing? >>> Do you have any source code you can share? >>> Are those results consistently better for images scanned with different >>> quality (resolution, angles, contrast etc)? >>> >>> >>> On Monday, September 4, 2023 at 2:02:27 AM UTC-6 nguyenng...@gmail.com >>> wrote: >>> >>>> Hi, >>>> I would like to hear other's opinions on your questions too. >>>> In my case, when I try using Tesseract for Japan train tickets, I have >>>> to do a lot of steps for preprocessing (remove background colors, noise + >>>> line removal, increase contrast, etc.) to get satisfactory results. >>>> I am sure what you are doing (locating text boxes, extracting them, and >>>> feeding them one by one to tesseract) can get better accuracy results. >>>> However, when the number of text boxes increases, it will undoubtedly >>>> affect your performance. >>>> Could you share the PSM mode for getting those text boxes' locations ? >>>> I usually use the AUTO_OSD to get the boxes and expand them a bit at >>>> the edges before passing them to Tesseract. >>>> >>>> Regards >>>> Hai >>>> >>>> On Saturday, September 2, 2023 at 7:03:49 AM UTC+9 apism...@gmail.com >>>> wrote: >>>> >>>>> I'm looking into OCR for ID cards and drivers licenses, and I found >>>>> out that tesseract performs relatively poor on ID cards, compared to >>>>> other >>>>> OCR solutions. For this original image: >>>>> https://github.com/apismensky/ocr_id/blob/main/images/boxes_easy/AR.png >>>>> the results are: >>>>> >>>>> tesseract: "4d DL 999 as = Ne allo) 2NICK © , q 12 RESTR oe } lick: 5 >>>>> DD 8888888888 <(888)%20888-8888> 1234 SZ" >>>>> easyocr: '''9 , ARKANSAS DRIVER'S LICENSE CLAss D 4d DLN 999999999 3 >>>>> DOB 03/05/1960 ] 2 SCKPLE 123 NORTH STREET CITY AR 12345 ISS 4b EXP >>>>> 03/05/2018 03/05/2026 15 SEX 16 HGT 18 EYES 5'-10" BRO 9a END NONE 12 >>>>> RESTR >>>>> NONE Ylck Sorble DD 8888888888 1234 THE''' >>>>> google cloud vision: """SARKANSAS\nSAMPLE\nSTATE O\n9 CLASS D\n4d DLN >>>>> 9999999993 DOB 03/05/1960\nNick Sample\nDRIVER'S LICENSE\n1 SAMPLE\n2 >>>>> NICK\n8 123 NORTH STREET\nCITY, AR 12345\n4a ISS\n03/05/2018\n15 SEX 16 >>>>> HGT\nM\n5'-10\"\nGREAT SE\n9a END NONE\n12 RESTR NONE\n5 DD 8888888888 >>>>> 1234\n4b EXP\n03/05/2026 MS60\n18 EYES\nBRO\nRKANSAS\n0""" >>>>> >>>>> and word accuracy is: >>>>> >>>>> tesseract | easyocr | google >>>>> words 10.34% | 68.97% | 82.76% >>>>> >>>>> This is "out if the box" performance, without any preprocessing. I'm >>>>> not surprised that google vision is that good compared to others, but >>>>> easyocr, which is another open source solution performs much better than >>>>> tesseract is this case. I have the whole project dedicated to this, and >>>>> all >>>>> other results are much better for easyocr: >>>>> https://github.com/apismensky/ocr_id/blob/main/result.json, all input >>>>> files are files in >>>>> https://github.com/apismensky/ocr_id/tree/main/images/sources >>>>> After digging into it for a little bit, I suspect that bounding box >>>>> detection is much better in google ( >>>>> https://github.com/apismensky/ocr_id/blob/main/images/boxes_google/AR.png) >>>>> >>>>> and easyocr ( >>>>> https://github.com/apismensky/ocr_id/blob/main/images/boxes_easy/AR.png), >>>>> than in tesseract ( >>>>> https://github.com/apismensky/ocr_id/blob/main/images/boxes_tesseract/AR.png). >>>>> >>>>> >>>>> I'm pretty sure, about this, cause when I manually cut the text boxes >>>>> and feed them to tesseract it works much better. >>>>> >>>>> >>>>> Now questions: >>>>> >>>>> - What is the part of the codebase in tesseract that is responsible >>>>> for text detection and which algorithm is it using? >>>>> - What is impacting bounding box detection in tesseract so it fails on >>>>> these types of images (complex layouts / background noise... etc) >>>>> - Is it possible to use the same text detection procedure as easyocr >>>>> or improve the existing one? >>>>> - Maybe possible to switch text detection algo based on the image type >>>>> or make it pluggable where user can configure from several options >>>>> A,B,C... >>>>> >>>>> >>>>> Thanks. >>>>> >>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/0f564279-fe68-484d-bbaa-d14ac254b6b9n%40googlegroups.com.