This requires you to create three input files.
1. List of legacy fonts, eg. FM series which all use same mapping for
Sinhala
2. Training text in legacy font, usually it will show up as garbled English
3. The above legacy text converted to Unicode, using an existing legacy to
Unicode converter, these are available online

Using these 3 files, this script will generate tif image files, wordstr box
files, lstmf files, it will also create a unicharset and all-lstmf file.

You can use it in conjunction with tesstrain repo . I plan to add a pull
request to the repo with the script along with some documentation.

On Sun, Oct 6, 2019, 07:59 Isurianuradha96 <[email protected]>
wrote:

> Seems this bash script (legacy.sh) is responsible for the mapping of
> non-Unicode fonts with legacy mapping (as a legacy to Unicode converter).
> And seems this script file is responsible for the generation of the box,tif
> and lstmf files. Am I right? so where should I place this script file in
> tesseract? or should I directly run this before the generation of the
> box,tif and lstmf files? Please correct me if my understanding is wrong.
>
> Thank you.
>
> On Sat, Oct 5, 2019 at 10:55 PM Shree Devi Kumar <[email protected]>
> wrote:
>
>> If you use linux, you can try similar to attached bash script.
>>
>> On Thu, Oct 3, 2019 at 2:55 PM Shree Devi Kumar <[email protected]>
>> wrote:
>>
>>> There is no direct method for training from non-unicode fonts.
>>> Tesseract's output is also Unicode text only.
>>>
>>> You can work from scanned images of text in non-unicode fonts and
>>> provide the unicode transcription of it. You could probably use a legacy to
>>> unicode converter for the text.
>>>
>>> See https://github.com/tesseract-ocr/tesstrain for training from single
>>> line images and its ground truth transcription.
>>>
>>> On Thu, Oct 3, 2019 at 2:27 PM isuri anuradha <[email protected]>
>>> wrote:
>>>
>>>> As you mentioned tesseract 4.0 is only support for the unicode fonts.
>>>> What is the procedure if we want to trained with non-unicode fonts. Since
>>>> most of the documents written in Sri Lanka are in non-unicode fonts and
>>>> there are lots of historical books available which written on non-unicode
>>>> forms.
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/a280b31b-f2c3-494e-a69e-ac3e36f02382%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/a280b31b-f2c3-494e-a69e-ac3e36f02382%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>
>>>
>>> --
>>>
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>
>>
>> --
>>
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU%3D7e_BUWrUhzhj4uRd%3DAXXi_46ewkSefUjtu2P69pXOQ%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU%3D7e_BUWrUhzhj4uRd%3DAXXi_46ewkSefUjtu2P69pXOQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>
>
> --
> Kind Regards,
> Isuri Anuradha.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CABjdo7D0Dj4G3-FTzuVQy9vq_efYr_OxOGE%3D5%3Ddw%3D1Pyptbu0g%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CABjdo7D0Dj4G3-FTzuVQy9vq_efYr_OxOGE%3D5%3Ddw%3D1Pyptbu0g%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV-Ls_9h74DbPgmornw6QP3JxMM%2B%2B%3Dj_F0kYP-yw4DDTQ%40mail.gmail.com.

Reply via email to