Re: [tesseract-ocr] Detecting language automatically

Merlijn B.W. Wajer Sat, 20 Mar 2021 17:29:13 -0700

Hi,

On 19/03/2021 10:11, Charles Cho wrote:
> Hello,
> I'm working on a ocr android app based on tesseract.
> I want to add feature that detects language automatically and recognize
> at least 2 languages at once.
> I have investigated on that for a while so I know that I have to specify
> language for tesseract.
> Then how can I implement auto detection of language?


Not exactly a mobile use case, but you can read how the Internet Archive
does this (I coined it "autonomous mode", where the software just
figures out the scripts and languages):

https://archive.org/services/docs/api/ocr.html#autonomous-mode

And the code is available, here (I plan to split out the archive.org
specific code from the python code that invokes Tesseract and performs
heuristics like script detection):

https://git.archive.org/www/tesseract/-/blob/master/main.py#L757

the tl;dr is to first perform script detection, and use the detected
script to OCR the page - then use language detection libraries to guess
the languages on the page.

> And tesseract on google play store can recognize 3 languages at once.
> Is it maximum?

I am not sure what you're finding on google play store, but I have found
there to be no limitation to the amount of languages that can be used
during OCR. Keep in mind that using more languages will slow down the
OCR process.

> Any help and advice would be really appreciated.

Hope this helps.

Cheers,
Merlijn

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/107cecab-c899-2e12-8621-e20f71a8c0f0%40archive.org.

Re: [tesseract-ocr] Detecting language automatically

Reply via email to