Also see a community contributed perl script for generating langdata in
https://github.com/tesseract-ocr/tesseract/tree/master/contrib

On Fri 6 Jul, 2018, 10:52 PM Shree Devi Kumar, <shreesh...@gmail.com> wrote:

> See the following link to comment by Ray regarding building of Training
> data
>
>
> https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951
>
> On Fri 6 Jul, 2018, 10:38 PM James Q, <james.quitten...@taina.tech> wrote:
>
>> No tool I can think of. What I would do is edit the file in a large text
>> file editor (such as EmEditor) to remove duplicate words. You could do this
>> by replacing all spaces for newlines then sorting and removing duplicates.
>> After that you can randomize the unique list of words, add an appropriate
>> distribution of punctuation characters and re-edit to create a block of
>> text wrapped at say 100 characters. There are online tools to do the
>> randomizing and wrapping.
>>
>> Having said this I don't know how valuable it is to have training text
>> containing specific words. I have been struggling myself to train on
>> specific word lists without much success. I think training text is just
>> about a representative distribution of characters. Please let me know if
>> you have any insights on the wordlists in langdata as I'm a bit hazy there.
>>
>> Thanks
>> James
>>
>>
>>
>> On Wednesday, July 4, 2018 at 9:02:13 AM UTC+1, Dd U wrote:
>>>
>>> Hello guys.
>>>
>>>
>>> I want to add new language script to Tesseract OCR and researching to
>>> training data.
>>>
>>>
>>> Then I want to know below things.
>>>
>>>    1. Is there any automatic tool that make a langdata training_text
>>>    and wordlist files from massive text?
>>>    2. Is there any documentation about preparing text data and
>>>    explanation about text data files? I just saw directory langdata/jpn/ and
>>>    there are some files. But I have know idea about this files and how to
>>>    create files like those? What rule should I use create langdata files?
>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/ccc8505c-216f-450a-9627-d85b2c9e21a9%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/ccc8505c-216f-450a-9627-d85b2c9e21a9%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVXgCT2t_tOBcnbyLKav9Sg86FnntUZLJ-SicwXsiXxCg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to