See https://github.com/tesseract-ocr/tesseract/pull/2710/commits/486928d1d6d88280227e923b89f4bc3051586d21
make the change to your local tesstrain_utils.sh and then run training On Mon, Oct 14, 2019 at 6:56 PM Isurianuradha96 <[email protected]> wrote: > I tried the sin.training_text inside the langdata_lstm (sin) folder. But > still same problem is there by giving warning message and normalization > failed message [1] > > > > On Mon, 14 Oct 2019, 18:34 Shree Devi Kumar, <[email protected]> wrote: > >> What about text in langdata_lstm? >> >> On Mon, Oct 14, 2019 at 2:44 PM Isurianuradha96 < >> [email protected]> wrote: >> >>> Regarding the normalization issue the training text. the >>> sin.training_text given by the tesseract (inside langdata folder) is >>> raising the same issue. Do you have sort out that error? >>> >>> On Thu, Oct 10, 2019 at 4:43 PM Shree Devi Kumar <[email protected]> >>> wrote: >>> >>>> See https://unicode.org/charts/PDF/U0D80.pdf >>>> >>>> 0DD0 $ැ SINHALA VOWEL SIGN KETTI AEDA-PILLA = sinhala vowel sign ae >>>> >>>> 0DCA $් SINHALA SIGN AL-LAKUNA = virama >>>> >>>> Your training text is not normalized. You have words beginning with >>>> combining marks. Fix the text before training to reduce errors. >>>> >>>> On Thu, Oct 10, 2019 at 4:08 PM Isurianuradha96 < >>>> [email protected]> wrote: >>>> >>>>> And also I want to know the reason for this kind off error prompting >>>>> at the terminal in the process of training.[2] >>>>> >>>>> [2]. >>>>> [image: image.png] >>>>> >>>>> Thank you. looking forward to your reply. >>>>> >>>>> >>>>> On Thu, Oct 10, 2019 at 3:41 PM Isurianuradha96 < >>>>> [email protected]> wrote: >>>>> >>>>>> Thanks a lot. But we have a dout on creating the model since in here >>>>>> each sentence is converted into box, tiff and lstmf. So how should we >>>>>> continue the process to make the model? >>>>>> And also can't we add multiple fonts in the process of creating >>>>>> model. At the moment we are using like the image [1]. If we want to add >>>>>> more fonts to parameter --fontlist how should we proceed? >>>>>> [1]. >>>>>> >>>>>> [image: image.png] >>>>>> >>>>>> Looking forward to hearing from you. >>>>>> Thank you. >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Oct 7, 2019 at 11:08 AM Shree Devi Kumar < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> You can assign any unique label to prefix and create the required >>>>>>> files in input directory to match the names. >>>>>>> >>>>>>> How are you running the bash script? bash legacy.sh >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Oct 7, 2019, 08:03 Isurianuradha96 < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> I have a small dout related to the 'PREFIX' parameter in >>>>>>>> legacy.sh. Since for FM models you have entered FM as the value to the >>>>>>>> prefix param. But for other non-unicodes (which are not a FM models) >>>>>>>> how we >>>>>>>> need to change that prefix value? and also I tried to execute the >>>>>>>> legacy. >>>>>>>> sh file. But it gave me error like image [1]. I change the font dir too >>>>>>>> [2]. what is the reason for that and how to fix it? >>>>>>>> >>>>>>>> [1]. >>>>>>>> >>>>>>>> [image: image.png] >>>>>>>> >>>>>>>> [2]. >>>>>>>> >>>>>>>> [image: image.png] >>>>>>>> >>>>>>>> On Mon, Oct 7, 2019 at 3:11 AM Isurianuradha96 < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Thanks a lot. Similarly as mentioned in above, can other >>>>>>>>> non-unicode fonts also be trained by following the similar way? >>>>>>>>> Looking forward a reply. Thank you. >>>>>>>>> >>>>>>>>> On Sun, Oct 6, 2019 at 2:46 PM Shree Devi Kumar < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> See attached zipfile with sample input and output files >>>>>>>>>> >>>>>>>>>> On Sun, Oct 6, 2019 at 12:44 PM Shree Devi Kumar < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> This requires you to create three input files. >>>>>>>>>>> 1. List of legacy fonts, eg. FM series which all use same >>>>>>>>>>> mapping for Sinhala >>>>>>>>>>> 2. Training text in legacy font, usually it will show up as >>>>>>>>>>> garbled English >>>>>>>>>>> 3. The above legacy text converted to Unicode, using an existing >>>>>>>>>>> legacy to Unicode converter, these are available online >>>>>>>>>>> >>>>>>>>>>> Using these 3 files, this script will generate tif image files, >>>>>>>>>>> wordstr box files, lstmf files, it will also create a unicharset and >>>>>>>>>>> all-lstmf file. >>>>>>>>>>> >>>>>>>>>>> You can use it in conjunction with tesstrain repo . I plan to >>>>>>>>>>> add a pull request to the repo with the script along with some >>>>>>>>>>> documentation. >>>>>>>>>>> >>>>>>>>>>> On Sun, Oct 6, 2019, 07:59 Isurianuradha96 < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Seems this bash script (legacy.sh) is responsible for the >>>>>>>>>>>> mapping of non-Unicode fonts with legacy mapping (as a legacy to >>>>>>>>>>>> Unicode >>>>>>>>>>>> converter). And seems this script file is responsible for the >>>>>>>>>>>> generation of >>>>>>>>>>>> the box,tif and lstmf files. Am I right? so where should I place >>>>>>>>>>>> this >>>>>>>>>>>> script file in tesseract? or should I directly run this before the >>>>>>>>>>>> generation of the box,tif and lstmf files? Please correct me if >>>>>>>>>>>> my understanding is wrong. >>>>>>>>>>>> >>>>>>>>>>>> Thank you. >>>>>>>>>>>> >>>>>>>>>>>> On Sat, Oct 5, 2019 at 10:55 PM Shree Devi Kumar < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> If you use linux, you can try similar to attached bash script. >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Oct 3, 2019 at 2:55 PM Shree Devi Kumar < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> There is no direct method for training from non-unicode >>>>>>>>>>>>>> fonts. Tesseract's output is also Unicode text only. >>>>>>>>>>>>>> >>>>>>>>>>>>>> You can work from scanned images of text in non-unicode fonts >>>>>>>>>>>>>> and provide the unicode transcription of it. You could probably >>>>>>>>>>>>>> use a >>>>>>>>>>>>>> legacy to unicode converter for the text. >>>>>>>>>>>>>> >>>>>>>>>>>>>> See https://github.com/tesseract-ocr/tesstrain for training >>>>>>>>>>>>>> from single line images and its ground truth transcription. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Oct 3, 2019 at 2:27 PM isuri anuradha < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> As you mentioned tesseract 4.0 is only support for the >>>>>>>>>>>>>>> unicode fonts. What is the procedure if we want to trained with >>>>>>>>>>>>>>> non-unicode >>>>>>>>>>>>>>> fonts. Since most of the documents written in Sri Lanka are in >>>>>>>>>>>>>>> non-unicode >>>>>>>>>>>>>>> fonts and there are lots of historical books available which >>>>>>>>>>>>>>> written on >>>>>>>>>>>>>>> non-unicode forms. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails >>>>>>>>>>>>>>> from it, send an email to >>>>>>>>>>>>>>> [email protected]. >>>>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/a280b31b-f2c3-494e-a69e-ac3e36f02382%40googlegroups.com >>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/a280b31b-f2c3-494e-a69e-ac3e36f02382%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>>> . >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> >>>>>>>>>>>>>> ____________________________________________________________ >>>>>>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> >>>>>>>>>>>>> ____________________________________________________________ >>>>>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>>>> it, send an email to >>>>>>>>>>>>> [email protected]. >>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU%3D7e_BUWrUhzhj4uRd%3DAXXi_46ewkSefUjtu2P69pXOQ%40mail.gmail.com >>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU%3D7e_BUWrUhzhj4uRd%3DAXXi_46ewkSefUjtu2P69pXOQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>> . >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Kind Regards, >>>>>>>>>>>> Isuri Anuradha. >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from >>>>>>>>>>>> it, send an email to [email protected] >>>>>>>>>>>> . >>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CABjdo7D0Dj4G3-FTzuVQy9vq_efYr_OxOGE%3D5%3Ddw%3D1Pyptbu0g%40mail.gmail.com >>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CABjdo7D0Dj4G3-FTzuVQy9vq_efYr_OxOGE%3D5%3Ddw%3D1Pyptbu0g%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>> . >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> ____________________________________________________________ >>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>>> send an email to [email protected]. >>>>>>>>>> To view this discussion on the web visit >>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXGAwRCa3HCPZHHgt13K8%2B%3DhfPEbcak81Y6JBLnZ2rjdA%40mail.gmail.com >>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXGAwRCa3HCPZHHgt13K8%2B%3DhfPEbcak81Y6JBLnZ2rjdA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>>>>>> . >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Kind Regards, >>>>>>>>> Isuri Anuradha. >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Kind Regards, >>>>>>>> Isuri Anuradha. >>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "tesseract-ocr" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to [email protected]. >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CABjdo7D%2BHpL3fPTNzuZZ%3DZ10MFksV_LsPn%3D7HQtU8DDbgP3-NA%40mail.gmail.com >>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CABjdo7D%2BHpL3fPTNzuZZ%3DZ10MFksV_LsPn%3D7HQtU8DDbgP3-NA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUZZdZ-O%2BJ72dRN83%2BGGctjYsNAArh9XrUdi9DX9rrRFg%40mail.gmail.com >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUZZdZ-O%2BJ72dRN83%2BGGctjYsNAArh9XrUdi9DX9rrRFg%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Kind Regards, >>>>>> Isuri Anuradha. >>>>>> >>>>> >>>>> >>>>> -- >>>>> Kind Regards, >>>>> Isuri Anuradha. >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/CABjdo7BX-EFt5%3DrTrmTjNvwbYDDKuLZe2-y0hDMK5SPVjF67Cg%40mail.gmail.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CABjdo7BX-EFt5%3DrTrmTjNvwbYDDKuLZe2-y0hDMK5SPVjF67Cg%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> >>>> >>>> -- >>>> >>>> ____________________________________________________________ >>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVzKGUAH3wcPy%2B1WMUc0%2BCst5JttLCsVFforuxuMGWHXg%40mail.gmail.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVzKGUAH3wcPy%2B1WMUc0%2BCst5JttLCsVFforuxuMGWHXg%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> >>> >>> -- >>> Kind Regards, >>> Isuri Anuradha. >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CABjdo7AoMHQ7F20UsxZRYfT%3DnVxC5PV-h3-Oeko%3DhPCguh3iWg%40mail.gmail.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/CABjdo7AoMHQ7F20UsxZRYfT%3DnVxC5PV-h3-Oeko%3DhPCguh3iWg%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >> >> >> -- >> >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW9TjOk7vptzUKqOK-fs7ngtoM6wCeqNq6mxD%2BA%2BryigA%40mail.gmail.com >> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW9TjOk7vptzUKqOK-fs7ngtoM6wCeqNq6mxD%2BA%2BryigA%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/CABjdo7A%2BScL52%3D2u1GXF1jcBThbcfAxQ%3D%2BbYAEX0HSU%2BmmHZug%40mail.gmail.com > <https://groups.google.com/d/msgid/tesseract-ocr/CABjdo7A%2BScL52%3D2u1GXF1jcBThbcfAxQ%3D%2BbYAEX0HSU%2BmmHZug%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVyjf89F_dffk5tBeqvrTNjv1wrnRWXbQeWN810wxbztQ%40mail.gmail.com.

