Is there something like tableExtractionDemo.cpp but for Python? I am unable to understand or replicate the C++ demo for the problem I am working on.
Thank you in advance! On Mon, Mar 22, 2021 at 2:26 AM Shree Devi Kumar <shreesh...@gmail.com> wrote: > Please see the newly added table detector to the master branch > > https://github.com/tesseract-ocr/tesseract/pull/3330 > > On Mon, Mar 22, 2021, 10:53 Daniel Lu <danielchen...@gmail.com> wrote: > >> Hi, >> >> I am trying to read hundreds of pages of information like the picture >> below into a CSV file. For us humans, it is very clear where the >> information should go in each of the four columns. But I am trying to use >> tesseract to do this! >> >> This is my code right now: >> >> ```{python} >> import cv2 >> import pytesseract >> import xlsxwriter >> import re >> >> img = cv2.imread("*image file path") >> pytesseract.pytesseract.tesseract_cmd = r"*tesseract location" >> >> # Initialize the workbook >> workbook = xlsxwriter.Workbook('result.xlsx') >> worksheet = workbook.add_worksheet() >> >> # Convert to the gray-scale >> gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) >> >> # Threshold >> thr = cv2.threshold(gry, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1] >> >> # OCR >> txt = pytesseract.image_to_string(thr, config="--psm 11") >> >> # Add ocr to the corresponding part >> txt = txt.split("\n") >> >> >> row = 0 >> col = 0 >> >> for txt1 in txt: >> # Skip over OCR strings that are just spaces or '' >> if txt1.isspace() or txt1 == '': >> continue >> >> # Hard code detection >> ...let's just place it into the last column for now >> >> # Theoretically, the state ("Alaska" in this case) will be in column 0 >> in the same row >> if re.match(r"\d*\sOpen\sRestaurants", txt1): >> col == 3 >> >> worksheet.write(row//4, col%4, txt1) >> col += 1 >> row += 1 >> >> workbook.close() >> >> ``` >> However, there are still a lot of miss-alignments, especially when some >> addresses or names take more than one line. Additionally, why is the text >> on the first line read in a different order compared to the rest of the >> rows? >> >> I was thinking that perhaps I could enforce that every fourth txt is in >> alphabetical order and use that to detect misalignment? But if even the >> first row is incorrect, I'm not sure how much I want to hard code >> corrections. Additionally, sometimes the multiple line entries arise from >> the address column while other times it arises from the name column (e.g. >> 258 Interstate Commercial Park Loop on the left-hand side of the page). >> >> Below are some screenshots of mixups on the left and right. >> >> Any help would be greatly appreciated! Thank you! >> >> >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/fbdeeed7-87b6-4e8c-9cf9-d91e0d84f04an%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/fbdeeed7-87b6-4e8c-9cf9-d91e0d84f04an%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- > You received this message because you are subscribed to a topic in the > Google Groups "tesseract-ocr" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/tesseract-ocr/WUDHFmyadXE/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU3XnZ2wgnNtkAJqpA5tr-GQk3aR0j2-fAxRKL5TPWiqg%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU3XnZ2wgnNtkAJqpA5tr-GQk3aR0j2-fAxRKL5TPWiqg%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAH0N-P1nJF6aZXpocLKDjFg2nnd3CPnxptudzSo82H5SfA7KJA%40mail.gmail.com.