See
https://github.com/tesseract-ocr/tesseract/pull/2710/commits/486928d1d6d88280227e923b89f4bc3051586d21

make the change to your local tesstrain_utils.sh and then run training

On Mon, Oct 14, 2019 at 6:56 PM Isurianuradha96 <[email protected]>
wrote:

> I tried the sin.training_text inside the langdata_lstm (sin) folder. But
> still same problem is there by giving warning message and normalization
> failed message [1]
>
>
>
> On Mon, 14 Oct 2019, 18:34 Shree Devi Kumar, <[email protected]> wrote:
>
>> What about text in langdata_lstm?
>>
>> On Mon, Oct 14, 2019 at 2:44 PM Isurianuradha96 <
>> [email protected]> wrote:
>>
>>> Regarding the normalization issue the training text. the
>>> sin.training_text given by the tesseract (inside langdata folder) is
>>> raising the same issue. Do you have sort out that error?
>>>
>>> On Thu, Oct 10, 2019 at 4:43 PM Shree Devi Kumar <[email protected]>
>>> wrote:
>>>
>>>> See https://unicode.org/charts/PDF/U0D80.pdf
>>>>
>>>>   0DD0 $ැ SINHALA VOWEL SIGN KETTI AEDA-PILLA = sinhala vowel sign ae
>>>>
>>>>   0DCA $් SINHALA SIGN AL-LAKUNA = virama
>>>>
>>>> Your training text is not normalized. You have words beginning with
>>>> combining marks. Fix the text before training to reduce errors.
>>>>
>>>> On Thu, Oct 10, 2019 at 4:08 PM Isurianuradha96 <
>>>> [email protected]> wrote:
>>>>
>>>>> And also I want to know the reason for this kind off error prompting
>>>>> at the terminal in the process of training.[2]
>>>>>
>>>>> [2].
>>>>> [image: image.png]
>>>>>
>>>>> Thank you. looking forward to your reply.
>>>>>
>>>>>
>>>>> On Thu, Oct 10, 2019 at 3:41 PM Isurianuradha96 <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Thanks a lot. But we have a dout on creating the model since in here
>>>>>> each sentence is converted into box, tiff and lstmf. So how should we
>>>>>> continue the process to make the model?
>>>>>> And also can't we add multiple fonts in the process of creating
>>>>>> model. At the moment we are using like the image [1]. If we want to add
>>>>>> more fonts to parameter --fontlist how should we proceed?
>>>>>> [1].
>>>>>>
>>>>>> [image: image.png]
>>>>>>
>>>>>> Looking forward to hearing from you.
>>>>>> Thank you.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 7, 2019 at 11:08 AM Shree Devi Kumar <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> You can assign any unique label to prefix and create the required
>>>>>>> files in input directory to match the names.
>>>>>>>
>>>>>>> How are you running the bash script? bash legacy.sh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Oct 7, 2019, 08:03 Isurianuradha96 <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> I have a small dout related to the 'PREFIX'  parameter in
>>>>>>>> legacy.sh. Since for FM models you have entered FM as the value to the
>>>>>>>> prefix param. But for other non-unicodes (which are not a FM models) 
>>>>>>>> how we
>>>>>>>> need to change that prefix value? and also I tried to execute the 
>>>>>>>> legacy.
>>>>>>>> sh file. But it gave me error like image [1]. I change the font dir too
>>>>>>>> [2]. what is the reason for that and how to fix it?
>>>>>>>>
>>>>>>>> [1].
>>>>>>>>
>>>>>>>> [image: image.png]
>>>>>>>>
>>>>>>>> [2].
>>>>>>>>
>>>>>>>> [image: image.png]
>>>>>>>>
>>>>>>>> On Mon, Oct 7, 2019 at 3:11 AM Isurianuradha96 <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Thanks a lot. Similarly as mentioned in above, can other
>>>>>>>>> non-unicode fonts also be trained by following the similar way?
>>>>>>>>> Looking forward a reply. Thank you.
>>>>>>>>>
>>>>>>>>> On Sun, Oct 6, 2019 at 2:46 PM Shree Devi Kumar <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> See attached zipfile with sample input and output files
>>>>>>>>>>
>>>>>>>>>> On Sun, Oct 6, 2019 at 12:44 PM Shree Devi Kumar <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> This requires you to create three input files.
>>>>>>>>>>> 1. List of legacy fonts, eg. FM series which all use same
>>>>>>>>>>> mapping for Sinhala
>>>>>>>>>>> 2. Training text in legacy font, usually it will show up as
>>>>>>>>>>> garbled English
>>>>>>>>>>> 3. The above legacy text converted to Unicode, using an existing
>>>>>>>>>>> legacy to Unicode converter, these are available online
>>>>>>>>>>>
>>>>>>>>>>> Using these 3 files, this script will generate tif image files,
>>>>>>>>>>> wordstr box files, lstmf files, it will also create a unicharset and
>>>>>>>>>>> all-lstmf file.
>>>>>>>>>>>
>>>>>>>>>>> You can use it in conjunction with tesstrain repo . I plan to
>>>>>>>>>>> add a pull request to the repo with the script along with some
>>>>>>>>>>> documentation.
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Oct 6, 2019, 07:59 Isurianuradha96 <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Seems this bash script (legacy.sh) is responsible for the
>>>>>>>>>>>> mapping of non-Unicode fonts with legacy mapping (as a legacy to 
>>>>>>>>>>>> Unicode
>>>>>>>>>>>> converter). And seems this script file is responsible for the 
>>>>>>>>>>>> generation of
>>>>>>>>>>>> the box,tif and lstmf files. Am I right? so where should I place 
>>>>>>>>>>>> this
>>>>>>>>>>>> script file in tesseract? or should I directly run this before the
>>>>>>>>>>>> generation of the  box,tif and lstmf files? Please correct me if
>>>>>>>>>>>> my understanding is wrong.
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you.
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Oct 5, 2019 at 10:55 PM Shree Devi Kumar <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> If you use linux, you can try similar to attached bash script.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Oct 3, 2019 at 2:55 PM Shree Devi Kumar <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> There is no direct method for training from non-unicode
>>>>>>>>>>>>>> fonts. Tesseract's output is also Unicode text only.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> You can work from scanned images of text in non-unicode fonts
>>>>>>>>>>>>>> and provide the unicode transcription of it. You could probably 
>>>>>>>>>>>>>> use a
>>>>>>>>>>>>>> legacy to unicode converter for the text.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> See https://github.com/tesseract-ocr/tesstrain for training
>>>>>>>>>>>>>> from single line images and its ground truth transcription.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Oct 3, 2019 at 2:27 PM isuri anuradha <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As you mentioned tesseract 4.0 is only support for the
>>>>>>>>>>>>>>> unicode fonts. What is the procedure if we want to trained with 
>>>>>>>>>>>>>>> non-unicode
>>>>>>>>>>>>>>> fonts. Since most of the documents written in Sri Lanka are in 
>>>>>>>>>>>>>>> non-unicode
>>>>>>>>>>>>>>> fonts and there are lots of historical books available which 
>>>>>>>>>>>>>>> written on
>>>>>>>>>>>>>>> non-unicode forms.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails
>>>>>>>>>>>>>>> from it, send an email to
>>>>>>>>>>>>>>> [email protected].
>>>>>>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/a280b31b-f2c3-494e-a69e-ac3e36f02382%40googlegroups.com
>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/a280b31b-f2c3-494e-a69e-ac3e36f02382%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ____________________________________________________________
>>>>>>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>>
>>>>>>>>>>>>> ____________________________________________________________
>>>>>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from
>>>>>>>>>>>>> it, send an email to
>>>>>>>>>>>>> [email protected].
>>>>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU%3D7e_BUWrUhzhj4uRd%3DAXXi_46ewkSefUjtu2P69pXOQ%40mail.gmail.com
>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU%3D7e_BUWrUhzhj4uRd%3DAXXi_46ewkSefUjtu2P69pXOQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>> .
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Kind Regards,
>>>>>>>>>>>> Isuri Anuradha.
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails from
>>>>>>>>>>>> it, send an email to [email protected]
>>>>>>>>>>>> .
>>>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CABjdo7D0Dj4G3-FTzuVQy9vq_efYr_OxOGE%3D5%3Ddw%3D1Pyptbu0g%40mail.gmail.com
>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CABjdo7D0Dj4G3-FTzuVQy9vq_efYr_OxOGE%3D5%3Ddw%3D1Pyptbu0g%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>> .
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> ____________________________________________________________
>>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> You received this message because you are subscribed to the
>>>>>>>>>> Google Groups "tesseract-ocr" group.
>>>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>>>> send an email to [email protected].
>>>>>>>>>> To view this discussion on the web visit
>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXGAwRCa3HCPZHHgt13K8%2B%3DhfPEbcak81Y6JBLnZ2rjdA%40mail.gmail.com
>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXGAwRCa3HCPZHHgt13K8%2B%3DhfPEbcak81Y6JBLnZ2rjdA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>>>>>> .
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Kind Regards,
>>>>>>>>> Isuri Anuradha.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Kind Regards,
>>>>>>>> Isuri Anuradha.
>>>>>>>>
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email to [email protected].
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CABjdo7D%2BHpL3fPTNzuZZ%3DZ10MFksV_LsPn%3D7HQtU8DDbgP3-NA%40mail.gmail.com
>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CABjdo7D%2BHpL3fPTNzuZZ%3DZ10MFksV_LsPn%3D7HQtU8DDbgP3-NA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>>
>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to [email protected].
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUZZdZ-O%2BJ72dRN83%2BGGctjYsNAArh9XrUdi9DX9rrRFg%40mail.gmail.com
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUZZdZ-O%2BJ72dRN83%2BGGctjYsNAArh9XrUdi9DX9rrRFg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Kind Regards,
>>>>>> Isuri Anuradha.
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Kind Regards,
>>>>> Isuri Anuradha.
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CABjdo7BX-EFt5%3DrTrmTjNvwbYDDKuLZe2-y0hDMK5SPVjF67Cg%40mail.gmail.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CABjdo7BX-EFt5%3DrTrmTjNvwbYDDKuLZe2-y0hDMK5SPVjF67Cg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVzKGUAH3wcPy%2B1WMUc0%2BCst5JttLCsVFforuxuMGWHXg%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVzKGUAH3wcPy%2B1WMUc0%2BCst5JttLCsVFforuxuMGWHXg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>
>>>
>>> --
>>> Kind Regards,
>>> Isuri Anuradha.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/CABjdo7AoMHQ7F20UsxZRYfT%3DnVxC5PV-h3-Oeko%3DhPCguh3iWg%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CABjdo7AoMHQ7F20UsxZRYfT%3DnVxC5PV-h3-Oeko%3DhPCguh3iWg%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>
>>
>> --
>>
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW9TjOk7vptzUKqOK-fs7ngtoM6wCeqNq6mxD%2BA%2BryigA%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduW9TjOk7vptzUKqOK-fs7ngtoM6wCeqNq6mxD%2BA%2BryigA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CABjdo7A%2BScL52%3D2u1GXF1jcBThbcfAxQ%3D%2BbYAEX0HSU%2BmmHZug%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CABjdo7A%2BScL52%3D2u1GXF1jcBThbcfAxQ%3D%2BbYAEX0HSU%2BmmHZug%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>


-- 

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVyjf89F_dffk5tBeqvrTNjv1wrnRWXbQeWN810wxbztQ%40mail.gmail.com.

Reply via email to