Re: [tesseract-ocr] Documentation related to lang data

Pubudu Tharaka Viswakula Sat, 15 Sep 2018 09:16:56 -0700

Hi Shree,

Thank you very much for improving my awareness. I have one more question,


When we create training files it creates the traineddata file. It's better
to include word-dwag, punc-dwag and number-dwag into it. Are they created
using the same files we're talking as sin.numbers, sin.punc and
sin.wordlist? As you explained me in an earlier questions they gets
included into traineddata file when we use tesstrain.sh right?

Thanks

On Sat, 15 Sep 2018, 18:53 Shree Devi Kumar, <[email protected]> wrote:

> *desired_characters*
>
> This is used by Google internally when creating the training text.
>
> Should I enter all those compound character combinations to this file?
>
> No, since this is not used by tesstrain.sh - at least in the open source
> version in Github.
>
> *okfonts.txt*
>
> This lists the Unicode fonts used for the LSTM training.
>
> Can I include non Unicode fonts into this file?
>
> NO. Because the rendered text will be incorrect.
>
> *sin.numbers*
>
> This file include all the number characters used in Sinhala.
>
> Unless something is changed in Google's internal training method, this
> should NOT have the number characters. Rather it should have patterns of
> how numbers maybe formatted when used in this language. It may help to look
> at the eng.numbers file for reference.
>
> *sin.punc*
> In lang data this contains punctuation combinations.
>
> Similar to the numbers file, this should have patterns of punctuation
> characters used in the language. Again, refer to eng.punc for reference.
>
>
> *sin.singles_text*
>
> Similar file to wordlist. Contains unique words followed by a new line
>
> In Devanagari it also has unique/rare syllables (compound character
> combinations). Without having the scripts used by Ray (Google) for
> training, it is difficult to say how this is used. I am guessing that these
> are used in a addition to training_text to build the unicharset.
>
> *sin.training_text*
>
> The training_text in langdata_lstm seems to be random words, numbers and
> phrases (based on English and Devanagari). So this maybe based on word
> frequencies in language. While Ray's notes on training say to use text that
> is representative of the language or text to be recognized, the
> training_text does not seem to be full sentences. It's possible that this
> kind of training_text gives better results with LSTM for recognizing
> text/words not seen before. I do not really know.
>
> *sin.unicharset*
>
> This file will be created when creating training data
>
> Yes, please check the sin.lstm-unicharset in the sin.traineddata files to
> check that all required characters are there.
>
> *sin.wordlist*
>
> Contains unique words followed by a new line
>
> This dictionary as well as punc and numbers are used to create dawg files
> which are stored in traineddata files and provide some improvement in
> recognition.
>
> -------------------------
>
> What you could do is create a file with all valid characters and syllables (
> compound character combinations) for Sinhala. Then use this file as input
> and grep the sin.training_text in langdata_lstm to mke sure that all combos
> are included in your training text for fine tuning.
>
>
>
> On Sat, Sep 15, 2018 at 7:43 PM, Shandigutt <[email protected]> wrote:
>
>> Hi,
>>
>> I downloaded latest lstm langdata from tesseract repository. I found it
>> consists of a lot of false data for Sinhala. I'm trying to train tesseract
>> for Sinhala. According to tesseract wiki guidelines, we need to create lang
>> data before creating training data using tesstrain.sh script. I'm
>> referring to the below wiki guidelines,
>>
>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
>>
>> I couldn't find proper wiki guidelines on creating lang data. When I
>> inspected the 'sin' folder in langdata-lstm I found the below files,
>>
>>
>>    - desired_characters
>>    - okfonts.txt
>>    - sin.numbers
>>    - sin.punc
>>    - sin.singles_text
>>    - sin.training_text
>>    - sin.unicharset
>>    - sin.wordlist
>>
>>
>> Please let me know if there's a proper documentation that I can follow if
>> I create these files on my own from the scratch. According to my
>> observations I have the following idea of these files. If there's no any
>> proper documentation of them please correct me if I mention anything wrong
>> here,
>>
>> *desired_characters*
>>
>> This file contains all the unique characters found in the language. Each
>> character followed by new line. My question is Sinhala language has many
>> vowel characters that create compound characters with Sinhala consonants.
>> Unlike English once a vowel character is attached to a consonant it creates
>> a single compound character most of the time which I can erase from a
>> single keyboard backspace. Please refer to the below example,
>>
>> Example 1:
>>
>> Consonant : ද
>>
>> Vowel character : ො
>>
>> Compound character : දො
>>
>> Example 2:
>>
>> Consonent : බ
>>
>> Vowel character : ්
>>
>> Compound character : බ්
>>
>> So each consonant + different vowel characters it makes a lot of compound
>> characters. Should I enter all those compound character combinations to
>> this file?
>>
>>
>> *okfonts.txt*
>>
>> This file includes the fonts I use in my training_text. Format is font
>> name followed by a new line. Can I include non Unicode fonts into this file?
>>
>> *sin.numbers*
>>
>> This file include all the number characters used in Sinhala. Number
>> character followed by a new line. Normally this contains only 10 characters
>>
>> *sin.punc*
>>
>> This character contains all the punctuation characters that can be used
>> in Sinhala text. Format is punctuation character followed by a new line. In
>> lang data this contains punctuation combinations. Please explain why?
>>
>> *sin.singles_text*
>>
>> Similar file to wordlist. Contains unique words followed by a new line
>>
>> *sin.training_text*
>>
>> Training text to be used when creating training data. Should contain
>> around 40000 text lines. Each line can have any amount of characters. It’s
>> better if this document contains text in multiple fonts that we have
>> defined in okfonts.txt. (These fonts can be passed as a command line
>> argument as well)
>>
>> *sin.unicharset*
>>
>> This file will be created when creating training data
>>
>> *sin.wordlist*
>>
>> Contains unique words followed by a new line
>>
>> Appreciate your response on this.
>>
>> Thanks
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/9e8be0bf-b0d5-4408-98b7-283913ccf642%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/9e8be0bf-b0d5-4408-98b7-283913ccf642%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> --
>
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com up
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVwUEM4SmsO8nSVwB76wsmdbzynwcK8-30_cDnEawW2Gg%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVwUEM4SmsO8nSVwB76wsmdbzynwcK8-30_cDnEawW2Gg%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAKOih%3DkGPrkU%2BCm1KFooKyUknQjR0i8zxf6ZrHzZkLg9vwNjfA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Documentation related to lang data

Reply via email to