Please see tesseract-ocr/tesstrain repo You need line images and their groundtruth text and the makefile will make box, lstmf and do the training.
Many blog posts and tutorials about tesseract training are for tesseract 3.0x. They will not work for Devanagari. You can also look at tesstutorial for 4.0. you can try plusminus or replace top layer type of training. For good results you need a lot of training data, eg. 50000 text lines. On Thu, Nov 12, 2020, 12:21 shreyansh dwivedi <advocates...@gmail.com> wrote: > Hello shree, > Than, what is the way to train the sanskrit along with roman diacritical > and achieve accuracy too or the alternative ways to do achieve this ? > > Regards, > > On Thu, Nov 5, 2020 at 8:15 PM Shree Devi Kumar <shreesh...@gmail.com> > wrote: > >> Legacy engine training won't work for Devanagari. The cube engine which >> was used in tesseract for Hindi has been removed. >> >> If you are only training for English and diacritics it may work for you. >> But note that there are no fine-tuning options for it. You have to train a >> model from scratch. >> >> ,..... >> >> shapetable, tr etc are all files for legacy engine, 3.0x and before. >> >> It is supported in tesseract4 with --oem 0 >> On Thu, Nov 5, 2020, 17:14 Shree Devi Kumar <shreesh...@gmail.com> wrote: >> >>> Are you trying to train for the legacy tesseract engine? >>> >>> On Thu, Nov 5, 2020, 16:46 shreyansh dwivedi <advocates...@gmail.com> >>> wrote: >>> >>>> hello shree i am attaching the image file , box file and the train.bash >>>> script in this email along with the error generated while running the >>>> script, FYIP currently i am using windows so run the bash script on msys2 >>>> terminal >>>> font_properties >>>> <https://drive.google.com/file/d/1s8RH4xjLwPjZ_go37F6CF07kT38vqG2s/view?usp=drive_web> >>>> san_NKP_int.inttemp >>>> <https://drive.google.com/file/d/18Plctl6Ia_dLhE-DMCI6zh0-1fCRy53q/view?usp=drive_web> >>>> san_NKP_int.normproto >>>> <https://drive.google.com/file/d/1Apbf1nrpXjGYD1-x4XfxFNLSWqMSCasb/view?usp=drive_web> >>>> san_NKP_int.ocrb.exp0.box >>>> <https://drive.google.com/file/d/1V4neOkxouYuoT0p4uSnp3RgqYmx0VfQK/view?usp=drive_web> >>>> san_NKP_int.ocrb.exp0.png >>>> <https://drive.google.com/file/d/1o-XZg3dZSwsFhrJtfuFOHlcpIvJ5ehlM/view?usp=drive_web> >>>> san_NKP_int.ocrb.exp0.tr >>>> <https://drive.google.com/file/d/1rgiQ8tWcYvxYS3MYgSZ19Wi-ulrudl7c/view?usp=drive_web> >>>> san_NKP_int.ocrb.exp1.box >>>> <https://drive.google.com/file/d/1CeTujdd_sFxgxPCj5ojkWc-riE0Jko0U/view?usp=drive_web> >>>> san_NKP_int.ocrb.exp1.png >>>> <https://drive.google.com/file/d/1S-NK7lG40r3aPsN9m8Fhg_JLgAcfZOeD/view?usp=drive_web> >>>> san_NKP_int.ocrb.exp1.tr >>>> <https://drive.google.com/file/d/1MzAaFkFOAGfBsdFVsvpQd9VuD9H9Srn7/view?usp=drive_web> >>>> san_NKP_int.ocrb.exp2.box >>>> <https://drive.google.com/file/d/1l2uVS73hFw6TjyCQeNFkQ8lYf-KBhjO9/view?usp=drive_web> >>>> san_NKP_int.ocrb.exp2.png >>>> <https://drive.google.com/file/d/1ywDR8j0K-ngGvj0WC0LAQYYkG6M64qDS/view?usp=drive_web> >>>> san_NKP_int.ocrb.exp2.tr >>>> <https://drive.google.com/file/d/1pcYoFkJvO0dFaY5OfuEaZwkyI5wjHobd/view?usp=drive_web> >>>> san_NKP_int.ocrb.exp3.box >>>> <https://drive.google.com/file/d/1zn4ZC4ueDryOW_oAslAIHH5di4zYlaWF/view?usp=drive_web> >>>> san_NKP_int.ocrb.exp3.png >>>> <https://drive.google.com/file/d/1j8hecGX9jVAchwpW5VMXCeIl0bvatMKG/view?usp=drive_web> >>>> san_NKP_int.ocrb.exp3.tr >>>> <https://drive.google.com/file/d/1LQJjrQtCRf3vbmPNpiJnwM_x1q0nWYoh/view?usp=drive_web> >>>> san_NKP_int.ocrb.exp4.box >>>> <https://drive.google.com/file/d/1WP3Oa5mxH0YsdM-HUZnBbh-OyEesWZy_/view?usp=drive_web> >>>> san_NKP_int.ocrb.exp4.png >>>> <https://drive.google.com/file/d/1TNkgDppOo3m5XAVb73evWLEFuH-mhtrW/view?usp=drive_web> >>>> san_NKP_int.ocrb.exp4.tr >>>> <https://drive.google.com/file/d/1hN2ORHCFo47wMw0BrkI77C0bW8ISFCzT/view?usp=drive_web> >>>> san_NKP_int.pffmtable >>>> <https://drive.google.com/file/d/1aIcJA4B-1yJzj54hcD6n-9eWZYBCCss2/view?usp=drive_web> >>>> san_NKP_int.shapetable >>>> <https://drive.google.com/file/d/1R4-yD_bMde_KJqGihH3-Uo9nVE6r-SqU/view?usp=drive_web> >>>> san_NKP_int.traineddata >>>> <https://drive.google.com/file/d/1nvyKsOVLhJs5uP1GcNHIOtqGkIe5Gt87/view?usp=drive_web> >>>> san_NKP_int.unicharset >>>> <https://drive.google.com/file/d/1BqMN29ZH8lTG9ZwkscmER8XkWQv9EQXm/view?usp=drive_web> >>>> train.bash >>>> <https://drive.google.com/file/d/1gUhDqGgjJCY5n4fc0ONNL943Qk-M3QeT/view?usp=drive_web> >>>> unicharset >>>> <https://drive.google.com/file/d/1ZhYZ663FXS2gqegIY2fDG-9IY8-du9Ud/view?usp=drive_web> >>>> below is the error screen shot generated while running the bash script >>>> [image: image.png] >>>> . >>>> >>>> [image: image.png] >>>> >>>> >>>> On Sat, Oct 31, 2020 at 4:20 PM Shree Devi Kumar <shreesh...@gmail.com> >>>> wrote: >>>> >>>>> >ṣ -> it recognises as ş >>>>> I cannot reproduce the issue. I am getting the following >>>>> >>>>> Line 120: praise of Viṣṇu. Lz. 1388. >>>>> Line 147: lakṣmī XXXIX. 51. >>>>> >>>>> Complete output is attached. It uses >>>>> https://github.com/Shreeshrii/tess5training-sanskrit-iast/blob/main/tessdata/fast/Sanskrit-1017-fast.traineddata >>>>> >>>>> Hello Shree, >>>>> I have a image comprising of sanskrit text and Romal Text comprising >>>>> of diacritical a, ā, ś, Ś, ṛ, ṇ, ṃ, ū, ī, ṭ, ṅ, ḍ, ṛ, ṣ. I am using the >>>>> sanskrit_int.tarinedata created by you, it recognises sanskrit text quite >>>>> good for properly scanned images but for the diacritical part only a few >>>>> characters could be identified namely ā, ū, but for >>>>> ṣ -> it recognises as ş >>>>> >>>>> right now i am using QTBoxEditor to correct the wrongly recognised >>>>> characters like the one above. >>>>> >>>>> I want to ask while training for the new language model some rules are >>>>> defined and one of them is the naming convention od image, here in this i >>>>> want to ask what is the font type and how to identify which font name is >>>>> used in the image : >>>>> [language name].[font name].exp[number].[file extension] >>>>> >>>>> how to identify what should bethe font name for the image >>>>> for better understanding i am attaching the image file. >>>>> >>>>> On Mon, Oct 19, 2020 at 4:45 PM Shree Devi Kumar <shreesh...@gmail.com> >>>>> wrote: >>>>> >>>>>> Please share the groundtruth for the test images also. >>>>>> >>>>>> Yes, you can certainly try to train on basis of these models. >>>>>> >>>>>> >>>>>> On Mon, Oct 19, 2020, 15:51 shreyansh dwivedi <advocates...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hello Shree, >>>>>>> Subh navratri, >>>>>>> I used the trained model build by you but unfortunately they are not >>>>>>> giving results, please refer to the picture and the text inscribed in >>>>>>> it, >>>>>>> what if we may build the model on the basis of it. PFA. >>>>>>> >>>>>>> Regards, >>>>>>> Shreyansh Dwivedi >>>>>>> >>>>>>> ---------- Forwarded message --------- >>>>>>> From: Shree Devi Kumar <shreesh...@gmail.com> >>>>>>> Date: Thu, Oct 8, 2020 at 6:18 PM >>>>>>> Subject: Re: [tesseract-ocr] Diacriticals Training >>>>>>> To: tesseract-ocr <tesseract-ocr@googlegroups.com> >>>>>>> >>>>>>> >>>>>>> I have uploaded the results of various trainings for IAST (with >>>>>>> diacritics) and Devanagari for Sanskrit at >>>>>>> https://github.com/Shreeshrii/tess5training-sanskrit-iast/tree/main/tessdata/best >>>>>>> . The traineddata files and the corresponding lstm-unicharset has been >>>>>>> uploaded there. >>>>>>> >>>>>>> The training has been done mostly with line images of synthetic >>>>>>> training data in various fonts. On evaluation datasets of synthetic >>>>>>> training data, not seen during training, I get a CER of 2-3%. I am >>>>>>> curious >>>>>>> to know how well these perform with real life images. >>>>>>> >>>>>>> I will appreciate if those who are testing can send me a few of >>>>>>> their test images along with the ground truth text. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> >>>>>>> Virus-free. >>>>>>> www.avg.com >>>>>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> >>>>>>> <#m_-5040435921525461297_m_3390908968527288306_m_-3921426355472222782_m_2388715278102219081_m_-5034749088946031926_m_-518494527659819167_m_1074673088079480863_m_-8626291968419235944_m_1597521380095537522_m_1988198995350034268_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> >>>>>>> >>>>>>> On Mon, Sep 28, 2020 at 12:19 PM Shree Devi Kumar < >>>>>>> shreesh...@gmail.com> wrote: >>>>>>> >>>>>>>> I am currently running a training run based on synthetic training >>>>>>>> data for Sanskrit to support both Devanagari script with vedic accents >>>>>>>> as >>>>>>>> well as iAST (Roman with diacritics support). I will share the >>>>>>>> traineddata >>>>>>>> for you and others who are interested to test how well it works with >>>>>>>> real >>>>>>>> life images. >>>>>>>> >>>>>>>> On Mon, Sep 28, 2020, 10:43 shreyansh dwivedi < >>>>>>>> advocates...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hello everyone, >>>>>>>>> I want to train some diacritical which are not present in >>>>>>>>> latin.trained model, apart from latin i used vietnamese and latvian >>>>>>>>> trained >>>>>>>>> model but the some of the diacriticals are missed in those models >>>>>>>>> too, some >>>>>>>>> of missed characters are mentioned below which i need to recognise. >>>>>>>>> ṭ >>>>>>>>> Ṭ >>>>>>>>> ṅ >>>>>>>>> ṭh >>>>>>>>> ḍ >>>>>>>>> ḍh >>>>>>>>> ṇ >>>>>>>>> ṃ >>>>>>>>> ṣ >>>>>>>>> Ḥ >>>>>>>>> ḥ >>>>>>>>> I want to train the above diacritical to recognise the characters >>>>>>>>> in the text image, through the tesseract engine. >>>>>>>>> Any help would be appreciated and from the scratch would be a >>>>>>>>> great way to understand. >>>>>>>>> Thank you! >>>>>>>>> >>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>> send an email to tesseract-ocr+unsubscr...@googlegroups.com. >>>>>>>>> To view this discussion on the web visit >>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAMREWd6R%2Bec5r%3D77%2BRWGM7PUKZPqqJT%2BkNX6r9zwijvW5sxykQ%40mail.gmail.com >>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAMREWd6R%2Bec5r%3D77%2BRWGM7PUKZPqqJT%2BkNX6r9zwijvW5sxykQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>>>>> . >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> ____________________________________________________________ >>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to tesseract-ocr+unsubscr...@googlegroups.com. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWRgU8JFRm2RP3ndzrsVVeS%3DFF%2BDg8w3LTkjR_kv9eU7g%40mail.gmail.com >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWRgU8JFRm2RP3ndzrsVVeS%3DFF%2BDg8w3LTkjR_kv9eU7g%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> >>>>> ____________________________________________________________ >>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesseract-ocr+unsubscr...@googlegroups.com. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUFM%3D%3DW%2BpybX69BpLgvEWa5a%3DjG5X4sMEk4T0C98P5sYA%40mail.gmail.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUFM%3D%3DW%2BpybX69BpLgvEWa5a%3DjG5X4sMEk4T0C98P5sYA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-ocr+unsubscr...@googlegroups.com. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/CAMREWd7c14tPPHB2xqJf1FvCgEep_pr6CMYLhuSoFT9GNsqvtA%40mail.gmail.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAMREWd7c14tPPHB2xqJf1FvCgEep_pr6CMYLhuSoFT9GNsqvtA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUJu%2B4fRB2vL0T_%3D6CMT4CZ%3DRccGRw24Pnc84QcTxtDLQ%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUJu%2B4fRB2vL0T_%3D6CMT4CZ%3DRccGRw24Pnc84QcTxtDLQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CAMREWd45DEt_y5EcXLQR0_gecJdEPJY1fNyGkmMDugYnGCDG%2BQ%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CAMREWd45DEt_y5EcXLQR0_gecJdEPJY1fNyGkmMDugYnGCDG%2BQ%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWAeQhB92x71X1pGJ%2BvH9sZ3L2ZmPcFucLWNcir%2BHD0GA%40mail.gmail.com.