Re: [tesseract-ocr] Re: Extraction of two different language text from single image using tesseract

Devarti Mahakalkar Wed, 08 Dec 2021 04:25:46 -0800

Hi Pankaj, 

Could you please share your approach for using more than one language in 
tesseract with good accuracy if you found any?


Thank you!
On Wednesday, 19 August 2020 at 18:03:29 UTC+1 Pankaj Gupta wrote:

> Hi Shree,
>
> Thank you for your suggestion. As per the suggested method, it improves 
> the pass percentage of the test cases. but the consistency of the 
> extraction of mixed language text is not up to the mark. Some times 
> tesseract is able to extract the characters correctly but not all the time. 
> e.g. in one of the scenarios, it is able to detect English alphabets that 
> come at the start of the text but in the next text, the English alphabet 
> coming at the end of the text is not getting extracted properly.
>
> One more problem we have identified that in a few of the images we have 
> numbers present in the superscripts, while applying OCR, the superscripts 
> numbers are not getting extracted.
>
> Please suggest.
> On Wednesday, August 19, 2020 at 1:40:14 PM UTC+5:30 shree wrote:
>
>> For multiple languages the standard invocation is to use the two language 
>> codes with + sign. 
>>
>> Eg. -l ara+eng or -l eng+jpn 
>>
>> Alternately you can also try the script traineddata files eg. Devanagari 
>> includes eng+hin+san+mar+nep
>>
>> However, multiple languages recognition takes more time and is not 
>> perfect.
>>
>> On Wed, Aug 19, 2020, 13:20 Pankaj Gupta <pan...@gaurishiv.org> wrote:
>>
>>> Dear Team,
>>>
>>> Waiting for your suggestions.  Need your help.
>>>
>>> Thank you in advance.
>>>
>>> Regards,
>>> Pankaj
>>>
>>> On Friday, August 14, 2020 at 12:45:05 AM UTC+5:30 Pankaj Gupta wrote:
>>>
>>>> Dear Team,
>>>>
>>>> Me and team is developing a tool that extract the text from the given 
>>>> images (containing data related to single language) using tesseract/ The 
>>>> tool is able to extract the text in 14 different languages with a higher 
>>>> accuracy greater than 95%.
>>>>
>>>> We have got a new challenge in the development that there are images 
>>>> that contain text in more than one language (Japanese - English or Arabic 
>>>> - 
>>>> English). due to copyright issues, I am not able to attach the original 
>>>> image, A sample image is attached along with this thread which contains 
>>>> text in Japanese and English depicting the actual scenarios. Request your 
>>>> support in identifying the technique to extract the text accurately in 
>>>> both 
>>>> the language.
>>>>
>>>> I am using Python 3+, open CV, and tesseract for development.
>>>>
>>>> Thanks in advance.
>>>>
>>>> Regards,
>>>> Pankaj Gupta
>>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/cc03edb3-b96b-477f-9b31-fe7e4a0ccb4cn%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/cc03edb3-b96b-477f-9b31-fe7e4a0ccb4cn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d362d93a-0ea2-4bdb-b391-982deb4d17een%40googlegroups.com.

Re: [tesseract-ocr] Re: Extraction of two different language text from single image using tesseract

Reply via email to