Hi Pankaj, 

Could you please share your approach for using more than one language in 
tesseract with good accuracy if you found any?

Thank you!
On Wednesday, 19 August 2020 at 18:03:29 UTC+1 Pankaj Gupta wrote:

> Hi Shree,
>
> Thank you for your suggestion. As per the suggested method, it improves 
> the pass percentage of the test cases. but the consistency of the 
> extraction of mixed language text is not up to the mark. Some times 
> tesseract is able to extract the characters correctly but not all the time. 
> e.g. in one of the scenarios, it is able to detect English alphabets that 
> come at the start of the text but in the next text, the English alphabet 
> coming at the end of the text is not getting extracted properly.
>
> One more problem we have identified that in a few of the images we have 
> numbers present in the superscripts, while applying OCR, the superscripts 
> numbers are not getting extracted.
>
> Please suggest.
> On Wednesday, August 19, 2020 at 1:40:14 PM UTC+5:30 shree wrote:
>
>> For multiple languages the standard invocation is to use the two language 
>> codes with + sign. 
>>
>> Eg. -l ara+eng or -l eng+jpn 
>>
>> Alternately you can also try the script traineddata files eg. Devanagari 
>> includes eng+hin+san+mar+nep
>>
>> However, multiple languages recognition takes more time and is not 
>> perfect.
>>
>> On Wed, Aug 19, 2020, 13:20 Pankaj Gupta <pan...@gaurishiv.org> wrote:
>>
>>> Dear Team,
>>>
>>> Waiting for your suggestions.  Need your help.
>>>
>>> Thank you in advance.
>>>
>>> Regards,
>>> Pankaj
>>>
>>> On Friday, August 14, 2020 at 12:45:05 AM UTC+5:30 Pankaj Gupta wrote:
>>>
>>>> Dear Team,
>>>>
>>>> Me and team is developing a tool that extract the text from the given 
>>>> images (containing data related to single language) using tesseract/ The 
>>>> tool is able to extract the text in 14 different languages with a higher 
>>>> accuracy greater than 95%.
>>>>
>>>> We have got a new challenge in the development that there are images 
>>>> that contain text in more than one language (Japanese - English or Arabic 
>>>> - 
>>>> English). due to copyright issues, I am not able to attach the original 
>>>> image, A sample image is attached along with this thread which contains 
>>>> text in Japanese and English depicting the actual scenarios. Request your 
>>>> support in identifying the technique to extract the text accurately in 
>>>> both 
>>>> the language.
>>>>
>>>> I am using Python 3+, open CV, and tesseract for development.
>>>>
>>>> Thanks in advance.
>>>>
>>>> Regards,
>>>> Pankaj Gupta
>>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/cc03edb3-b96b-477f-9b31-fe7e4a0ccb4cn%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/cc03edb3-b96b-477f-9b31-fe7e4a0ccb4cn%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d362d93a-0ea2-4bdb-b391-982deb4d17een%40googlegroups.com.

Reply via email to