Re: [tesseract-ocr] Detecting language automatically

2021-03-21 Thread Charles Cho
Hi, Merlijn.

Thanks for your kind response.

Regarding autonomous mode, I'm trying to find such module for Android.
But I found nothing. I will try more.

>I am not sure what you're finding on google play store, but I have found
>there to be no limitation to the amount of languages that can be used
>during OCR. Keep in mind that using more languages will slow down the
>OCR process.
It's textfairy, open source app.
https://play.google.com/store/apps/details?id=com.renard.ocr

Your response is really helpful.

Best,
Charles.
On Sunday, March 21, 2021 at 8:29:13 AM UTC+8 Merlijn Wajer wrote:

> Hi,
>
> On 19/03/2021 10:11, Charles Cho wrote:
> > Hello,
> > I'm working on a ocr android app based on tesseract.
> > I want to add feature that detects language automatically and recognize
> > at least 2 languages at once.
> > I have investigated on that for a while so I know that I have to specify
> > language for tesseract.
> > Then how can I implement auto detection of language?
>
> Not exactly a mobile use case, but you can read how the Internet Archive
> does this (I coined it "autonomous mode", where the software just
> figures out the scripts and languages):
>
> https://archive.org/services/docs/api/ocr.html#autonomous-mode
>
> And the code is available, here (I plan to split out the archive.org
> specific code from the python code that invokes Tesseract and performs
> heuristics like script detection):
>
> https://git.archive.org/www/tesseract/-/blob/master/main.py#L757
>
> the tl;dr is to first perform script detection, and use the detected
> script to OCR the page - then use language detection libraries to guess
> the languages on the page.
>
> > And tesseract on google play store can recognize 3 languages at once.
> > Is it maximum?
>
> I am not sure what you're finding on google play store, but I have found
> there to be no limitation to the amount of languages that can be used
> during OCR. Keep in mind that using more languages will slow down the
> OCR process.
>
> > Any help and advice would be really appreciated.
>
> Hope this helps.
>
> Cheers,
> Merlijn
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/53bafa0d-88ce-4ff1-bc96-4e6b05cf5420n%40googlegroups.com.


Re: [tesseract-ocr] Properly Insert OCR Into Separate Columns

2021-03-21 Thread Shree Devi Kumar
Please see the newly added table detector to the master branch

https://github.com/tesseract-ocr/tesseract/pull/3330

On Mon, Mar 22, 2021, 10:53 Daniel Lu  wrote:

> Hi,
>
> I am trying to read hundreds of pages of information like the picture
> below into a CSV file. For us humans, it is very clear where the
> information should go in each of the four columns. But I am trying to use
> tesseract to do this!
>
> This is my code right now:
>
> ```{python}
> import cv2
> import pytesseract
> import xlsxwriter
> import re
>
> img = cv2.imread("*image file path")
> pytesseract.pytesseract.tesseract_cmd = r"*tesseract location"
>
> # Initialize the workbook
> workbook = xlsxwriter.Workbook('result.xlsx')
> worksheet = workbook.add_worksheet()
>
> # Convert to the gray-scale
> gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
>
> # Threshold
> thr = cv2.threshold(gry, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
>
> # OCR
> txt = pytesseract.image_to_string(thr, config="--psm 11")
>
> # Add ocr to the corresponding part
> txt = txt.split("\n")
>
>
> row = 0
> col = 0
>
> for txt1 in txt:
> # Skip over OCR strings that are just spaces or ''
> if txt1.isspace() or txt1 == '':
> continue
>
> # Hard code detection
> ...let's just place it into the last column for now
>
> # Theoretically, the state ("Alaska" in this case) will be in column 0 in 
> the same row
> if re.match(r"\d*\sOpen\sRestaurants", txt1):
> col == 3
>
> worksheet.write(row//4, col%4, txt1)
> col += 1
> row += 1
>
> workbook.close()
>
> ```
> However, there are still a lot of miss-alignments, especially when some
> addresses or names take more than one line. Additionally, why is the text
> on the first line read in a different order compared to the rest of the
> rows?
>
> I was thinking that perhaps I could enforce that every fourth txt is in
> alphabetical order and use that to detect misalignment? But if even the
> first row is incorrect, I'm not sure how much I want to hard code
> corrections. Additionally, sometimes the multiple line entries arise from
> the address column while other times it arises from the name column (e.g.
> 258 Interstate Commercial Park Loop on the left-hand side of the page).
>
> Below are some screenshots of mixups on the left and right.
>
> Any help would be greatly appreciated! Thank you!
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/fbdeeed7-87b6-4e8c-9cf9-d91e0d84f04an%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU3XnZ2wgnNtkAJqpA5tr-GQk3aR0j2-fAxRKL5TPWiqg%40mail.gmail.com.