Hi Alexey, Thank you very much for your detailed explanation. Sorry for my late reply. I got dragged into different matters in the last few days.
Apparently, I was not aware of the .uzn file usage for Tesseract before. Thank you. In my previous project, I did apply preprocessing for each block image (as some may have different background noises or low-quality images). However, doing that is really not a good approach for such large-size images with a great number of text boxes. I used Python multi-process to boost the speed up a little bit. With that, depending on the number of CPU cores, we can process multiple images parallelly. In the above sample of mine, as you almost get 100% correct results from the text boxes. I will try to apply some preprocessing methods, to see if the results can be improved further. I will let you know right after that. Meanwhile, I still hope to hear updates on your questions. Regards Hai. On Saturday, September 9, 2023 at 12:27:53 AM UTC+9 apism...@gmail.com wrote: > Hai, sorry I have missed a lot of details in my last message, so I will > try to clarify. > Disclaimer: I'm not a computer vision guru, nor a ML or data science guy - > just regular software development background. > - API to extract efficiently all the text regions for some > multi-background and text colors samples > I don't think that tesseract out-of-the-box has a decent text region > detection. That is what I'm trying to figure out in my post. Tesseract > folks have not responded to it yet, IDK if anyone of them is here in this > mail group. Looks like there are better options out there (CRAFT is just > one of them https://arxiv.org/abs/1904.01941) IDK why they can not be > integrated into tesseract. > - did you pass them one by one to Tesseract - yes, when CRAFT is executed > in https://github.com/apismensky/ocr_id/blob/main/ocr_id.py#L233 it > creates a bunch of crop files, for your ticket example they are in > https://github.com/apismensky/ocr_id/tree/main/images/boxes_craft/ticket_crops. > > It also creates a text region map in > https://github.com/apismensky/ocr_id/blob/main/images/boxes_craft/ticket_text_detection.txt. > > Most of them are rectangles (8 numbers in one row x1,y1,....x4,y4) but some > maybe polygones (as you can see in the other files). > Then I sort all the files by crop_NUMBER in > https://github.com/apismensky/ocr_id/blob/main/ocr_id.py#L265C6-L265C6 so > that they are ordered by their appearance in the original image. > Then I loop through all of them in > https://github.com/apismensky/ocr_id/blob/main/ocr_id.py#L267 and feed > each image to tesseract in > https://github.com/apismensky/ocr_id/blob/main/ocr_id.py#L270. Notice > that I'm using psm=7 there, cause we already know that each image is a box > with a single text line, and the join them together in crop_result = ' > '.join(res). > Also notice that I'm not doing any pre-processing, I wonder what the > result will be with some preprocessing for each image - hopefully better? > I have tried another approach - passing a map of text regions, detected by > CRAFT to tesseract, so it will not try to do its own text detection. > The motivation was to reduce the number of calls to tesseract for each > crop (reduce the time) > That's what .uzn files are for. > So for you example it will be something like: > tesseract ticket.png - --psm 4 -l script/Japanese > Notice that > https://github.com/apismensky/ocr_id/blob/main/images/sources/ticket.uzn > is in the same folder as an original image, and it has the same name as the > file (minus extension) > There is a little function that convert CRAFT text boxes to tesseract .uzn > files, https://github.com/apismensky/ocr_id/blob/main/ocr_id.py#L175 > The problem is that you can not really use pytesseract.image_to_data, I > assume this is because of filename mismatch: image_to_data (most probably) > creates a temp file in the filesystem that does not match to .uzn file > name. > So I did it by calling subprocess.check_output(command, shell=True, > text=True) in > https://github.com/apismensky/ocr_id/blob/main/ocr_id.py#L101C18-L101C73 > to kinda manually run tesseract as an external process. As I mentioned in > my las message this approach did not give me the output for the regions > with inverted colors (white letters on black background) > Hopefully that makes sense, LMK of you have further questions. > > > BTW I was looking for some more or less substantial information about an > architecture of tesseract - at least at the level of main components, > pipeline, algorithms etc - could not find it. if you (or anyone) are aware > - please LMK. > On Friday, September 8, 2023 at 6:18:42 AM UTC-6 nguyenng...@gmail.com > wrote: > >> Hi Alexey, >> >> Thank you very much for trying out my sample. It is very informative to >> understand how CRAFT could extract correctly the text regions. As far as I >> know, Tesseract has a very nice Python wrapper tesserocr >> <https://github.com/sirfz/tesserocr/tree/9c8740dae227f60e5a3c2763783d52f19119172b>, >> >> which provides many easy-to-use methods to analyze the image texts with a >> range of PSM, and RIL modes. However, unfortunately, I was not able to find >> a good method from the API to extract efficiently all the text regions for >> some multi-background and text colors samples. >> >> The results you provided are actually very promising. I have not read >> your code carefully yet, but may I ask that after getting all the text >> regions, did you pass them one by one to Tesseract or how did you get the >> CRAFT >> + crop result: ... (with accuracy 48.78)? >> >> As I noticed, some lines on the sample can be noisy for the results. I >> think that if applying the line removal method, the results can be better. >> I do not quite understand the technique of creating a map of boxes using >> .uzn files and passing it to Tesseract, can you explain a bit further? And >> yes, you are right, not only the 金額 was not there, but all of the >> dark-background text regions are not shown in the second results ( such as >> 摘要 数 重 単位 単 価 金額, etc. ) >> >> Apology for the conversation becoming longer, but your questions are yet >> to be answered. I am deeply interested in understanding them too. >> >> Regards >> Hai. >> >> On Friday, September 8, 2023 at 5:19:26 AM UTC+9 apism...@gmail.com >> wrote: >> >>> Thanks for sharing Hai >>> Looks like CRAFT can detect regions despite the background: >>> https://github.com/apismensky/ocr_id/blob/main/images/boxes_craft/black_background_text_detection.png >>> . >>> It also creates cuts for each text region which can be OCR-ed separately >>> and then joined together as a result. >>> When I ran your example with >>> https://github.com/apismensky/ocr_id/blob/main/ocr_id.py I've got the >>> following output: >>> >>> CRAFT + crop result: 下記 の と お り 、 御 請求 申し 上 げ ま す 。 株 式 会 社 ス キャ ナ 保 存 >>> スト レー ジ プ ロ ジェ クト 件 名 T573-0011 2023/4/30 大 阪 市 北 区 大 深町 3-1 支払 期限 山口 銀行 本 >>> 店 普通 1111111 グラ ン フ ロン ト 大 阪 タ ワーB 振込 先 TEL : 06-6735-8055 担当 : ICS 太 郎 >>> 66,000 円 (税込 ) a 摘要 数 重 単位 単 価 金額 サン プル 1 1 式 32,000 32,000 サン プル 2 1 式 >>> 18000 18,000 2,000 2,000' 8g,000' 2,000 >>> crop word_accuracy: 48.78048780487805 >>> >>> I've tried to create a map of boxes using .uzn files and pass it to >>> tesseract, but results are worse: >>> CRAFT result: 下記 の と お り 、 御 請求 申し 上 げ ま す 。 >>> >>> 株 式 会 社 ス キャ ナ 保 存 >>> >>> スト レー ジ プ ロ ジェ クト >>> >>> 〒573-0011 >>> >>> 2023/4/30 >>> >>> 大 阪 市 北 区 大 深町 3-1 >>> >>> 山口 銀行 本 店 普通 1111111 >>> >>> グラ ン フ ロン ト 大 阪 タ ワーB >>> >>> TEL : 06-6735-8055 >>> >>> 担当 : ICS 太 郎 >>> >>> 66,000 円 (税込 ) >>> >>> サン プル 1 >>> >>> 1| 式 >>> >>> 32,000 >>> >>> 32,000 >>> >>> サン プル 2 >>> >>> 1| 式 >>> >>> 18000 >>> >>> 18,000 >>> >>> 2,000 >>> >>> 2,000. >>> >>> 8,000 >>> >>> 8,000 >>> >>> craft word_accuracy: 36.58536585365854. >>> >>> Apparently 金額 is not there; >>> Sorry, my Japanese is little bit rusty :-) >>> I have an impression that when I pass the map with .uzn text regions to >>> tesseract it applies one transformation to pre-process the image, but when >>> I'm passing each individual images it preprocess it separately, applying >>> the best strategy for each region? Of course it is slower this way. >>> >>> On Wednesday, September 6, 2023 at 7:07:52 PM UTC-6 >>> nguyenng...@gmail.com wrote: >>> >>>> Hi Apismensky, >>>> >>>> Here are the code and sample I used for preprocessing, I extracted the >>>> ticket region of the train ticket from a picture taken by a smartphone. >>>> Since the angle, distance, brightness, and many other factors can change >>>> the picture quality. >>>> I would say scanned images or fixed-position camera-taken images have >>>> more consistent quality. >>>> >>>> Here is the original image: >>>> >>>> [image: sample_to_remove_lines.png] >>>> >>>> # TRy to remove lines >>>> org_image = cv2.imread("/content/sample_to_remove_lines.png") >>>> cv2_show('org_image', org_image) >>>> gray = cv2.cvtColor(org_image,cv2.COLOR_BGR2GRAY) >>>> >>>> thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + >>>> cv2.THRESH_OTSU)[1] >>>> cv2_show('thresh Otsu', thresh) >>>> >>>> >>>> # removing noise dots. >>>> opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, np.ones((2,2), >>>> np.uint8)) >>>> cv2_show('opening', opening) >>>> >>>> thresh = opening.copy() >>>> mask = np.zeros_like(org_image, dtype=np.uint8) >>>> >>>> # Extract horizontal lines >>>> horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (60 ,1)) >>>> remove_horizontal = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, >>>> horizontal_kernel, iterations=2) >>>> cnts = cv2.findContours(remove_horizontal, cv2.RETR_EXTERNAL, >>>> cv2.CHAIN_APPROX_SIMPLE) >>>> cnts = cnts[0] if len(cnts) == 2 else cnts[1] >>>> for c in cnts: >>>> cv2.drawContours(mask, [c], -1, (255, 255, 255), 8) >>>> # cv2_show('mask extract horizontal lines', mask) >>>> >>>> # Extract vertical lines >>>> vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1,70)) >>>> remove_vertical = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, >>>> vertical_kernel, iterations=2) >>>> cnts = cv2.findContours(remove_vertical, cv2.RETR_EXTERNAL, >>>> cv2.CHAIN_APPROX_SIMPLE) >>>> cnts = cnts[0] if len(cnts) == 2 else cnts[1] >>>> for c in cnts: >>>> cv2.drawContours(mask, [c], -1, (255, 255, 255), 8) >>>> >>>> cv2_show('mask extract lines', mask) >>>> >>>> result = org_image.copy() >>>> # Loop through the pixels of the original image and modify based on the >>>> mask >>>> for y in range(mask.shape[0]): >>>> for x in range(mask.shape[1]): >>>> if np.all(mask[y, x] == 255): # If pixel is white in mask >>>> result[y, x] = [255, 255, 255] # Set pixel to white >>>> >>>> cv2_show("result", result2) >>>> >>>> gray = cv2.cvtColor(result2,cv2.COLOR_BGR2GRAY) >>>> _, simple_thresh = cv2.threshold(gray, 195, 255, cv2.THRESH_BINARY) >>>> cv2_show('simple_thresh', simple_thresh) >>>> >>>> >>>> in the above code, u can ignore the cv2_show function since it is just >>>> my custom method for showing images. >>>> You can see that the idea is to remove some noise, remove lines, and >>>> then simple-thresh. >>>> [image: extracted_lines.png] >>>> >>>> [image: removed_lines.png] >>>> >>>> >>>> [image: ready_for_locating_text_box.png] >>>> >>>> I would say, from this point, the AUTO_OSD mode of Tesseract PSM can >>>> also give the text box for the above picture, it also needs to check with >>>> RIL mode (maybe RIL.WORD or RIL.TEXTLINE) to get the right level of >>>> textboxes. >>>> In my opinion, the same preprocessing methods can only be applied to a >>>> certain group of samples. It is in fact very hard to cover all the cases. >>>> For example: >>>> >>>> [image: black_background.png] >>>> >>>> I found it difficult to locate the text box where the text is white, >>>> and the background is dark colors. the black text on the white background >>>> is easy to locate and then OCR. I am not sure what are a good method to >>>> locate those white texts on the dark background colors. >>>> I hope to hear your as well as others's suggestions on this matter. >>>> >>>> Regards >>>> Hai >>>> On Wednesday, September 6, 2023 at 12:32:56 AM UTC+9 apism...@gmail.com >>>> wrote: >>>> >>>>> Hai, could you please tell me what you are doing for pre-processing? >>>>> Do you have any source code you can share? >>>>> Are those results consistently better for images scanned with >>>>> different quality (resolution, angles, contrast etc)? >>>>> >>>>> >>>>> On Monday, September 4, 2023 at 2:02:27 AM UTC-6 nguyenng...@gmail.com >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> I would like to hear other's opinions on your questions too. >>>>>> In my case, when I try using Tesseract for Japan train tickets, I >>>>>> have to do a lot of steps for preprocessing (remove background colors, >>>>>> noise + line removal, increase contrast, etc.) to get satisfactory >>>>>> results. >>>>>> I am sure what you are doing (locating text boxes, extracting them, >>>>>> and feeding them one by one to tesseract) can get better accuracy >>>>>> results. >>>>>> However, when the number of text boxes increases, it will undoubtedly >>>>>> affect your performance. >>>>>> Could you share the PSM mode for getting those text boxes' locations >>>>>> ? I usually use the AUTO_OSD to get the boxes and expand them a bit >>>>>> at the edges before passing them to Tesseract. >>>>>> >>>>>> Regards >>>>>> Hai >>>>>> >>>>>> On Saturday, September 2, 2023 at 7:03:49 AM UTC+9 apism...@gmail.com >>>>>> wrote: >>>>>> >>>>>>> I'm looking into OCR for ID cards and drivers licenses, and I found >>>>>>> out that tesseract performs relatively poor on ID cards, compared to >>>>>>> other >>>>>>> OCR solutions. For this original image: >>>>>>> https://github.com/apismensky/ocr_id/blob/main/images/boxes_easy/AR.png >>>>>>> the results are: >>>>>>> >>>>>>> tesseract: "4d DL 999 as = Ne allo) 2NICK © , q 12 RESTR oe } lick: >>>>>>> 5 DD 8888888888 <(888)%20888-8888> 1234 SZ" >>>>>>> easyocr: '''9 , ARKANSAS DRIVER'S LICENSE CLAss D 4d DLN 999999999 >>>>>>> 3 DOB 03/05/1960 ] 2 SCKPLE 123 NORTH STREET CITY AR 12345 ISS 4b EXP >>>>>>> 03/05/2018 03/05/2026 15 SEX 16 HGT 18 EYES 5'-10" BRO 9a END NONE 12 >>>>>>> RESTR >>>>>>> NONE Ylck Sorble DD 8888888888 1234 THE''' >>>>>>> google cloud vision: """SARKANSAS\nSAMPLE\nSTATE O\n9 CLASS D\n4d >>>>>>> DLN 9999999993 DOB 03/05/1960\nNick Sample\nDRIVER'S LICENSE\n1 >>>>>>> SAMPLE\n2 >>>>>>> NICK\n8 123 NORTH STREET\nCITY, AR 12345\n4a ISS\n03/05/2018\n15 SEX 16 >>>>>>> HGT\nM\n5'-10\"\nGREAT SE\n9a END NONE\n12 RESTR NONE\n5 DD 8888888888 >>>>>>> 1234\n4b EXP\n03/05/2026 MS60\n18 EYES\nBRO\nRKANSAS\n0""" >>>>>>> >>>>>>> and word accuracy is: >>>>>>> >>>>>>> tesseract | easyocr | google >>>>>>> words 10.34% | 68.97% | 82.76% >>>>>>> >>>>>>> This is "out if the box" performance, without any preprocessing. I'm >>>>>>> not surprised that google vision is that good compared to others, but >>>>>>> easyocr, which is another open source solution performs much better >>>>>>> than >>>>>>> tesseract is this case. I have the whole project dedicated to this, and >>>>>>> all >>>>>>> other results are much better for easyocr: >>>>>>> https://github.com/apismensky/ocr_id/blob/main/result.json, all >>>>>>> input files are files in >>>>>>> https://github.com/apismensky/ocr_id/tree/main/images/sources >>>>>>> After digging into it for a little bit, I suspect that bounding box >>>>>>> detection is much better in google ( >>>>>>> https://github.com/apismensky/ocr_id/blob/main/images/boxes_google/AR.png) >>>>>>> >>>>>>> and easyocr ( >>>>>>> https://github.com/apismensky/ocr_id/blob/main/images/boxes_easy/AR.png), >>>>>>> >>>>>>> than in tesseract ( >>>>>>> https://github.com/apismensky/ocr_id/blob/main/images/boxes_tesseract/AR.png). >>>>>>> >>>>>>> >>>>>>> I'm pretty sure, about this, cause when I manually cut the text >>>>>>> boxes and feed them to tesseract it works much better. >>>>>>> >>>>>>> >>>>>>> Now questions: >>>>>>> >>>>>>> - What is the part of the codebase in tesseract that is responsible >>>>>>> for text detection and which algorithm is it using? >>>>>>> - What is impacting bounding box detection in tesseract so it fails >>>>>>> on these types of images (complex layouts / background noise... etc) >>>>>>> - Is it possible to use the same text detection procedure as easyocr >>>>>>> or improve the existing one? >>>>>>> - Maybe possible to switch text detection algo based on the image >>>>>>> type or make it pluggable where user can configure from several options >>>>>>> A,B,C... >>>>>>> >>>>>>> >>>>>>> Thanks. >>>>>>> >>>>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/98cdcf88-7e16-41f8-be77-1a66ce354709n%40googlegroups.com.