Hi Pankaj, Could you please share your approach for using more than one language in tesseract with good accuracy if you found any?
Thank you! On Wednesday, 19 August 2020 at 18:03:29 UTC+1 Pankaj Gupta wrote: > Hi Shree, > > Thank you for your suggestion. As per the suggested method, it improves > the pass percentage of the test cases. but the consistency of the > extraction of mixed language text is not up to the mark. Some times > tesseract is able to extract the characters correctly but not all the time. > e.g. in one of the scenarios, it is able to detect English alphabets that > come at the start of the text but in the next text, the English alphabet > coming at the end of the text is not getting extracted properly. > > One more problem we have identified that in a few of the images we have > numbers present in the superscripts, while applying OCR, the superscripts > numbers are not getting extracted. > > Please suggest. > On Wednesday, August 19, 2020 at 1:40:14 PM UTC+5:30 shree wrote: > >> For multiple languages the standard invocation is to use the two language >> codes with + sign. >> >> Eg. -l ara+eng or -l eng+jpn >> >> Alternately you can also try the script traineddata files eg. Devanagari >> includes eng+hin+san+mar+nep >> >> However, multiple languages recognition takes more time and is not >> perfect. >> >> On Wed, Aug 19, 2020, 13:20 Pankaj Gupta <pan...@gaurishiv.org> wrote: >> >>> Dear Team, >>> >>> Waiting for your suggestions. Need your help. >>> >>> Thank you in advance. >>> >>> Regards, >>> Pankaj >>> >>> On Friday, August 14, 2020 at 12:45:05 AM UTC+5:30 Pankaj Gupta wrote: >>> >>>> Dear Team, >>>> >>>> Me and team is developing a tool that extract the text from the given >>>> images (containing data related to single language) using tesseract/ The >>>> tool is able to extract the text in 14 different languages with a higher >>>> accuracy greater than 95%. >>>> >>>> We have got a new challenge in the development that there are images >>>> that contain text in more than one language (Japanese - English or Arabic >>>> - >>>> English). due to copyright issues, I am not able to attach the original >>>> image, A sample image is attached along with this thread which contains >>>> text in Japanese and English depicting the actual scenarios. Request your >>>> support in identifying the technique to extract the text accurately in >>>> both >>>> the language. >>>> >>>> I am using Python 3+, open CV, and tesseract for development. >>>> >>>> Thanks in advance. >>>> >>>> Regards, >>>> Pankaj Gupta >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/cc03edb3-b96b-477f-9b31-fe7e4a0ccb4cn%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/cc03edb3-b96b-477f-9b31-fe7e4a0ccb4cn%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d362d93a-0ea2-4bdb-b391-982deb4d17een%40googlegroups.com.