Is there something like tableExtractionDemo.cpp but for Python? I am unable
to understand or replicate the C++ demo for the problem I am working on.

Thank you in advance!

On Mon, Mar 22, 2021 at 2:26 AM Shree Devi Kumar <shreesh...@gmail.com>
wrote:

> Please see the newly added table detector to the master branch
>
> https://github.com/tesseract-ocr/tesseract/pull/3330
>
> On Mon, Mar 22, 2021, 10:53 Daniel Lu <danielchen...@gmail.com> wrote:
>
>> Hi,
>>
>> I am trying to read hundreds of pages of information like the picture
>> below into a CSV file. For us humans, it is very clear where the
>> information should go in each of the four columns. But I am trying to use
>> tesseract to do this!
>>
>> This is my code right now:
>>
>> ```{python}
>> import cv2
>> import pytesseract
>> import xlsxwriter
>> import re
>>
>> img = cv2.imread("*image file path")
>> pytesseract.pytesseract.tesseract_cmd = r"*tesseract location"
>>
>> # Initialize the workbook
>> workbook = xlsxwriter.Workbook('result.xlsx')
>> worksheet = workbook.add_worksheet()
>>
>> # Convert to the gray-scale
>> gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
>>
>> # Threshold
>> thr = cv2.threshold(gry, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
>>
>> # OCR
>> txt = pytesseract.image_to_string(thr, config="--psm 11")
>>
>> # Add ocr to the corresponding part
>> txt = txt.split("\n")
>>
>>
>> row = 0
>> col = 0
>>
>> for txt1 in txt:
>>     # Skip over OCR strings that are just spaces or ''
>>     if txt1.isspace() or txt1 == '':
>>         continue
>>
>>     # Hard code detection
>> ...let's just place it into the last column for now
>>
>>     # Theoretically, the state ("Alaska" in this case) will be in column 0 
>> in the same row
>>     if re.match(r"\d*\sOpen\sRestaurants", txt1):
>>         col == 3
>>
>>     worksheet.write(row//4, col%4, txt1)
>>     col += 1
>>     row += 1
>>
>> workbook.close()
>>
>> ```
>> However, there are still a lot of miss-alignments, especially when some
>> addresses or names take more than one line. Additionally, why is the text
>> on the first line read in a different order compared to the rest of the
>> rows?
>>
>> I was thinking that perhaps I could enforce that every fourth txt is in
>> alphabetical order and use that to detect misalignment? But if even the
>> first row is incorrect, I'm not sure how much I want to hard code
>> corrections. Additionally, sometimes the multiple line entries arise from
>> the address column while other times it arises from the name column (e.g.
>> 258 Interstate Commercial Park Loop on the left-hand side of the page).
>>
>> Below are some screenshots of mixups on the left and right.
>>
>> Any help would be greatly appreciated! Thank you!
>>
>>
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/fbdeeed7-87b6-4e8c-9cf9-d91e0d84f04an%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/fbdeeed7-87b6-4e8c-9cf9-d91e0d84f04an%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "tesseract-ocr" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/tesseract-ocr/WUDHFmyadXE/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU3XnZ2wgnNtkAJqpA5tr-GQk3aR0j2-fAxRKL5TPWiqg%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU3XnZ2wgnNtkAJqpA5tr-GQk3aR0j2-fAxRKL5TPWiqg%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAH0N-P1nJF6aZXpocLKDjFg2nnd3CPnxptudzSo82H5SfA7KJA%40mail.gmail.com.

Reply via email to