Re: [tesseract-ocr] Detecting language automatically

Charles Cho Thu, 25 Mar 2021 11:04:50 -0700

Hi.

Thank you very much for your kind help, shree.
I tried to detect script by your help and it worked. Great.


I have some questions.
1. If the image contains texts of different languages in a page, is there 
any way to detect all of the languages? Now it detects only one language.
2. It detects English, German, French as 'Latin'. So how can I distinguish 
the languages exactly?

Thanks.
Best,
Charles.

On Thursday, March 25, 2021 at 9:49:10 PM UTC+8 shree wrote:

> See 
> https://github.com/tesseract-ocr/tessdoc/blob/master/examples/OSD_example.cc
>
> //Get OSD - new code
>     int orient_deg;
>     float orient_conf;
>     const char* script_name;
>     float script_conf;
>     api->DetectOrientationScript(&orient_deg, &orient_conf, &script_name, 
> &script_conf);
>     printf("************\n Orientation in degrees: %d\n Orientation 
> confidence: %.2f\n"
>     " Script: %s\n Script confidence: %.2f\n",
>     orient_deg, orient_conf,
>     script_name, script_conf);
>
> On Thursday, March 25, 2021 at 2:11:42 PM UTC+5:30 charles...@gmail.com 
> wrote:
>
>> Hi,
>>
>> I have investigated on trying to detect language automatically.
>> I referred to these links. Thank you, Merlijin.
>> https://archive.org/services/docs/api/ocr.html#autonomous-mode
>> https://git.archive.org/www/tesseract/-/blob/master/main.py#L757
>>
>> So in my analysis, it used OSD of tesseract engine to detect layout and 
>> script.
>> After detect script, it detects languages on the script.
>>
>> So I tried to use OSD engine mode based on textfairy which is Android OCR 
>> app based on tesseract 4.1.1.
>> But it doesn't work and I can't make sure how I can use OSD engine mode 
>> in Android.
>> I set 'osd' as language option string and used osd.traindata and set 
>> 'OEM_OSD_ONLY' as engine mode.
>> But it doesn't work.
>>
>> Hope anyone can help you to use OSD engine mode in Android.
>>
>> Thank you.
>> Best,
>> Charles.
>>
>> On Monday, March 22, 2021 at 10:28:38 AM UTC+8 Charles Cho wrote:
>>
>>> Hi, Merlijn.
>>>
>>> Thanks for your kind response.
>>>
>>> Regarding autonomous mode, I'm trying to find such module for Android.
>>> But I found nothing. I will try more.
>>>
>>> >I am not sure what you're finding on google play store, but I have found
>>> >there to be no limitation to the amount of languages that can be used
>>> >during OCR. Keep in mind that using more languages will slow down the
>>> >OCR process.
>>> It's textfairy, open source app.
>>> https://play.google.com/store/apps/details?id=com.renard.ocr
>>>
>>> Your response is really helpful.
>>>
>>> Best,
>>> Charles.
>>> On Sunday, March 21, 2021 at 8:29:13 AM UTC+8 Merlijn Wajer wrote:
>>>
>>>> Hi, 
>>>>
>>>> On 19/03/2021 10:11, Charles Cho wrote: 
>>>> > Hello, 
>>>> > I'm working on a ocr android app based on tesseract. 
>>>> > I want to add feature that detects language automatically and 
>>>> recognize 
>>>> > at least 2 languages at once. 
>>>> > I have investigated on that for a while so I know that I have to 
>>>> specify 
>>>> > language for tesseract. 
>>>> > Then how can I implement auto detection of language? 
>>>>
>>>> Not exactly a mobile use case, but you can read how the Internet 
>>>> Archive 
>>>> does this (I coined it "autonomous mode", where the software just 
>>>> figures out the scripts and languages): 
>>>>
>>>> https://archive.org/services/docs/api/ocr.html#autonomous-mode 
>>>>
>>>> And the code is available, here (I plan to split out the archive.org 
>>>> specific code from the python code that invokes Tesseract and performs 
>>>> heuristics like script detection): 
>>>>
>>>> https://git.archive.org/www/tesseract/-/blob/master/main.py#L757 
>>>>
>>>> the tl;dr is to first perform script detection, and use the detected 
>>>> script to OCR the page - then use language detection libraries to guess 
>>>> the languages on the page. 
>>>>
>>>> > And tesseract on google play store can recognize 3 languages at once. 
>>>> > Is it maximum? 
>>>>
>>>> I am not sure what you're finding on google play store, but I have 
>>>> found 
>>>> there to be no limitation to the amount of languages that can be used 
>>>> during OCR. Keep in mind that using more languages will slow down the 
>>>> OCR process. 
>>>>
>>>> > Any help and advice would be really appreciated. 
>>>>
>>>> Hope this helps. 
>>>>
>>>> Cheers, 
>>>> Merlijn 
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c6c896fc-5e0c-40b6-af7f-f66c424ecd7cn%40googlegroups.com.

Re: [tesseract-ocr] Detecting language automatically

Reply via email to