Hello shree, Than, what is the way to train the sanskrit along with roman diacritical and achieve accuracy too or the alternative ways to do achieve this ?
Regards, On Thu, Nov 5, 2020 at 8:15 PM Shree Devi Kumar <shreesh...@gmail.com> wrote: > Legacy engine training won't work for Devanagari. The cube engine which > was used in tesseract for Hindi has been removed. > > If you are only training for English and diacritics it may work for you. > But note that there are no fine-tuning options for it. You have to train a > model from scratch. > > ,..... > > shapetable, tr etc are all files for legacy engine, 3.0x and before. > > It is supported in tesseract4 with --oem 0 > On Thu, Nov 5, 2020, 17:14 Shree Devi Kumar <shreesh...@gmail.com> wrote: > >> Are you trying to train for the legacy tesseract engine? >> >> On Thu, Nov 5, 2020, 16:46 shreyansh dwivedi <advocates...@gmail.com> >> wrote: >> >>> hello shree i am attaching the image file , box file and the train.bash >>> script in this email along with the error generated while running the >>> script, FYIP currently i am using windows so run the bash script on msys2 >>> terminal >>> font_properties >>> <https://drive.google.com/file/d/1s8RH4xjLwPjZ_go37F6CF07kT38vqG2s/view?usp=drive_web> >>> san_NKP_int.inttemp >>> <https://drive.google.com/file/d/18Plctl6Ia_dLhE-DMCI6zh0-1fCRy53q/view?usp=drive_web> >>> san_NKP_int.normproto >>> <https://drive.google.com/file/d/1Apbf1nrpXjGYD1-x4XfxFNLSWqMSCasb/view?usp=drive_web> >>> san_NKP_int.ocrb.exp0.box >>> <https://drive.google.com/file/d/1V4neOkxouYuoT0p4uSnp3RgqYmx0VfQK/view?usp=drive_web> >>> san_NKP_int.ocrb.exp0.png >>> <https://drive.google.com/file/d/1o-XZg3dZSwsFhrJtfuFOHlcpIvJ5ehlM/view?usp=drive_web> >>> san_NKP_int.ocrb.exp0.tr >>> <https://drive.google.com/file/d/1rgiQ8tWcYvxYS3MYgSZ19Wi-ulrudl7c/view?usp=drive_web> >>> san_NKP_int.ocrb.exp1.box >>> <https://drive.google.com/file/d/1CeTujdd_sFxgxPCj5ojkWc-riE0Jko0U/view?usp=drive_web> >>> san_NKP_int.ocrb.exp1.png >>> <https://drive.google.com/file/d/1S-NK7lG40r3aPsN9m8Fhg_JLgAcfZOeD/view?usp=drive_web> >>> san_NKP_int.ocrb.exp1.tr >>> <https://drive.google.com/file/d/1MzAaFkFOAGfBsdFVsvpQd9VuD9H9Srn7/view?usp=drive_web> >>> san_NKP_int.ocrb.exp2.box >>> <https://drive.google.com/file/d/1l2uVS73hFw6TjyCQeNFkQ8lYf-KBhjO9/view?usp=drive_web> >>> san_NKP_int.ocrb.exp2.png >>> <https://drive.google.com/file/d/1ywDR8j0K-ngGvj0WC0LAQYYkG6M64qDS/view?usp=drive_web> >>> san_NKP_int.ocrb.exp2.tr >>> <https://drive.google.com/file/d/1pcYoFkJvO0dFaY5OfuEaZwkyI5wjHobd/view?usp=drive_web> >>> san_NKP_int.ocrb.exp3.box >>> <https://drive.google.com/file/d/1zn4ZC4ueDryOW_oAslAIHH5di4zYlaWF/view?usp=drive_web> >>> san_NKP_int.ocrb.exp3.png >>> <https://drive.google.com/file/d/1j8hecGX9jVAchwpW5VMXCeIl0bvatMKG/view?usp=drive_web> >>> san_NKP_int.ocrb.exp3.tr >>> <https://drive.google.com/file/d/1LQJjrQtCRf3vbmPNpiJnwM_x1q0nWYoh/view?usp=drive_web> >>> san_NKP_int.ocrb.exp4.box >>> <https://drive.google.com/file/d/1WP3Oa5mxH0YsdM-HUZnBbh-OyEesWZy_/view?usp=drive_web> >>> san_NKP_int.ocrb.exp4.png >>> <https://drive.google.com/file/d/1TNkgDppOo3m5XAVb73evWLEFuH-mhtrW/view?usp=drive_web> >>> san_NKP_int.ocrb.exp4.tr >>> <https://drive.google.com/file/d/1hN2ORHCFo47wMw0BrkI77C0bW8ISFCzT/view?usp=drive_web> >>> san_NKP_int.pffmtable >>> <https://drive.google.com/file/d/1aIcJA4B-1yJzj54hcD6n-9eWZYBCCss2/view?usp=drive_web> >>> san_NKP_int.shapetable >>> <https://drive.google.com/file/d/1R4-yD_bMde_KJqGihH3-Uo9nVE6r-SqU/view?usp=drive_web> >>> san_NKP_int.traineddata >>> <https://drive.google.com/file/d/1nvyKsOVLhJs5uP1GcNHIOtqGkIe5Gt87/view?usp=drive_web> >>> san_NKP_int.unicharset >>> <https://drive.google.com/file/d/1BqMN29ZH8lTG9ZwkscmER8XkWQv9EQXm/view?usp=drive_web> >>> train.bash >>> <https://drive.google.com/file/d/1gUhDqGgjJCY5n4fc0ONNL943Qk-M3QeT/view?usp=drive_web> >>> unicharset >>> <https://drive.google.com/file/d/1ZhYZ663FXS2gqegIY2fDG-9IY8-du9Ud/view?usp=drive_web> >>> below is the error screen shot generated while running the bash script >>> [image: image.png] >>> . >>> >>> [image: image.png] >>> >>> >>> On Sat, Oct 31, 2020 at 4:20 PM Shree Devi Kumar <shreesh...@gmail.com> >>> wrote: >>> >>>> >ṣ -> it recognises as ş >>>> I cannot reproduce the issue. I am getting the following >>>> >>>> Line 120: praise of Viṣṇu. Lz. 1388. >>>> Line 147: lakṣmī XXXIX. 51. >>>> >>>> Complete output is attached. It uses >>>> https://github.com/Shreeshrii/tess5training-sanskrit-iast/blob/main/tessdata/fast/Sanskrit-1017-fast.traineddata >>>> >>>> Hello Shree, >>>> I have a image comprising of sanskrit text and Romal Text comprising >>>> of diacritical a, ā, ś, Ś, ṛ, ṇ, ṃ, ū, ī, ṭ, ṅ, ḍ, ṛ, ṣ. I am using the >>>> sanskrit_int.tarinedata created by you, it recognises sanskrit text quite >>>> good for properly scanned images but for the diacritical part only a few >>>> characters could be identified namely ā, ū, but for >>>> ṣ -> it recognises as ş >>>> >>>> right now i am using QTBoxEditor to correct the wrongly recognised >>>> characters like the one above. >>>> >>>> I want to ask while training for the new language model some rules are >>>> defined and one of them is the naming convention od image, here in this i >>>> want to ask what is the font type and how to identify which font name is >>>> used in the image : >>>> [language name].[font name].exp[number].[file extension] >>>> >>>> how to identify what should bethe font name for the image >>>> for better understanding i am attaching the image file. >>>> >>>> On Mon, Oct 19, 2020 at 4:45 PM Shree Devi Kumar <shreesh...@gmail.com> >>>> wrote: >>>> >>>>> Please share the groundtruth for the test images also. >>>>> >>>>> Yes, you can certainly try to train on basis of these models. >>>>> >>>>> >>>>> On Mon, Oct 19, 2020, 15:51 shreyansh dwivedi <advocates...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hello Shree, >>>>>> Subh navratri, >>>>>> I used the trained model build by you but unfortunately they are not >>>>>> giving results, please refer to the picture and the text inscribed in it, >>>>>> what if we may build the model on the basis of it. PFA. >>>>>> >>>>>> Regards, >>>>>> Shreyansh Dwivedi >>>>>> >>>>>> ---------- Forwarded message --------- >>>>>> From: Shree Devi Kumar <shreesh...@gmail.com> >>>>>> Date: Thu, Oct 8, 2020 at 6:18 PM >>>>>> Subject: Re: [tesseract-ocr] Diacriticals Training >>>>>> To: tesseract-ocr <tesseract-ocr@googlegroups.com> >>>>>> >>>>>> >>>>>> I have uploaded the results of various trainings for IAST (with >>>>>> diacritics) and Devanagari for Sanskrit at >>>>>> https://github.com/Shreeshrii/tess5training-sanskrit-iast/tree/main/tessdata/best >>>>>> . The traineddata files and the corresponding lstm-unicharset has been >>>>>> uploaded there. >>>>>> >>>>>> The training has been done mostly with line images of synthetic >>>>>> training data in various fonts. On evaluation datasets of synthetic >>>>>> training data, not seen during training, I get a CER of 2-3%. I am >>>>>> curious >>>>>> to know how well these perform with real life images. >>>>>> >>>>>> I will appreciate if those who are testing can send me a few of their >>>>>> test images along with the ground truth text. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> >>>>>> Virus-free. >>>>>> www.avg.com >>>>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> >>>>>> <#m_3390908968527288306_m_-3921426355472222782_m_2388715278102219081_m_-5034749088946031926_m_-518494527659819167_m_1074673088079480863_m_-8626291968419235944_m_1597521380095537522_m_1988198995350034268_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> >>>>>> >>>>>> On Mon, Sep 28, 2020 at 12:19 PM Shree Devi Kumar < >>>>>> shreesh...@gmail.com> wrote: >>>>>> >>>>>>> I am currently running a training run based on synthetic training >>>>>>> data for Sanskrit to support both Devanagari script with vedic accents >>>>>>> as >>>>>>> well as iAST (Roman with diacritics support). I will share the >>>>>>> traineddata >>>>>>> for you and others who are interested to test how well it works with >>>>>>> real >>>>>>> life images. >>>>>>> >>>>>>> On Mon, Sep 28, 2020, 10:43 shreyansh dwivedi < >>>>>>> advocates...@gmail.com> wrote: >>>>>>> >>>>>>>> Hello everyone, >>>>>>>> I want to train some diacritical which are not present in >>>>>>>> latin.trained model, apart from latin i used vietnamese and latvian >>>>>>>> trained >>>>>>>> model but the some of the diacriticals are missed in those models too, >>>>>>>> some >>>>>>>> of missed characters are mentioned below which i need to recognise. >>>>>>>> ṭ >>>>>>>> Ṭ >>>>>>>> ṅ >>>>>>>> ṭh >>>>>>>> ḍ >>>>>>>> ḍh >>>>>>>> ṇ >>>>>>>> ṃ >>>>>>>> ṣ >>>>>>>> Ḥ >>>>>>>> ḥ >>>>>>>> I want to train the above diacritical to recognise the characters >>>>>>>> in the text image, through the tesseract engine. >>>>>>>> Any help would be appreciated and from the scratch would be a great >>>>>>>> way to understand. >>>>>>>> Thank you! >>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "tesseract-ocr" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to tesseract-ocr+unsubscr...@googlegroups.com. >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAMREWd6R%2Bec5r%3D77%2BRWGM7PUKZPqqJT%2BkNX6r9zwijvW5sxykQ%40mail.gmail.com >>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAMREWd6R%2Bec5r%3D77%2BRWGM7PUKZPqqJT%2BkNX6r9zwijvW5sxykQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> ____________________________________________________________ >>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to tesseract-ocr+unsubscr...@googlegroups.com. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWRgU8JFRm2RP3ndzrsVVeS%3DFF%2BDg8w3LTkjR_kv9eU7g%40mail.gmail.com >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWRgU8JFRm2RP3ndzrsVVeS%3DFF%2BDg8w3LTkjR_kv9eU7g%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>> >>>> >>>> -- >>>> >>>> ____________________________________________________________ >>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-ocr+unsubscr...@googlegroups.com. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUFM%3D%3DW%2BpybX69BpLgvEWa5a%3DjG5X4sMEk4T0C98P5sYA%40mail.gmail.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUFM%3D%3DW%2BpybX69BpLgvEWa5a%3DjG5X4sMEk4T0C98P5sYA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-ocr+unsubscr...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CAMREWd7c14tPPHB2xqJf1FvCgEep_pr6CMYLhuSoFT9GNsqvtA%40mail.gmail.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/CAMREWd7c14tPPHB2xqJf1FvCgEep_pr6CMYLhuSoFT9GNsqvtA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUJu%2B4fRB2vL0T_%3D6CMT4CZ%3DRccGRw24Pnc84QcTxtDLQ%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUJu%2B4fRB2vL0T_%3D6CMT4CZ%3DRccGRw24Pnc84QcTxtDLQ%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMREWd45DEt_y5EcXLQR0_gecJdEPJY1fNyGkmMDugYnGCDG%2BQ%40mail.gmail.com.