Also see a community contributed perl script for generating langdata in https://github.com/tesseract-ocr/tesseract/tree/master/contrib
On Fri 6 Jul, 2018, 10:52 PM Shree Devi Kumar, <shreesh...@gmail.com> wrote: > See the following link to comment by Ray regarding building of Training > data > > > https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951 > > On Fri 6 Jul, 2018, 10:38 PM James Q, <james.quitten...@taina.tech> wrote: > >> No tool I can think of. What I would do is edit the file in a large text >> file editor (such as EmEditor) to remove duplicate words. You could do this >> by replacing all spaces for newlines then sorting and removing duplicates. >> After that you can randomize the unique list of words, add an appropriate >> distribution of punctuation characters and re-edit to create a block of >> text wrapped at say 100 characters. There are online tools to do the >> randomizing and wrapping. >> >> Having said this I don't know how valuable it is to have training text >> containing specific words. I have been struggling myself to train on >> specific word lists without much success. I think training text is just >> about a representative distribution of characters. Please let me know if >> you have any insights on the wordlists in langdata as I'm a bit hazy there. >> >> Thanks >> James >> >> >> >> On Wednesday, July 4, 2018 at 9:02:13 AM UTC+1, Dd U wrote: >>> >>> Hello guys. >>> >>> >>> I want to add new language script to Tesseract OCR and researching to >>> training data. >>> >>> >>> Then I want to know below things. >>> >>> 1. Is there any automatic tool that make a langdata training_text >>> and wordlist files from massive text? >>> 2. Is there any documentation about preparing text data and >>> explanation about text data files? I just saw directory langdata/jpn/ and >>> there are some files. But I have know idea about this files and how to >>> create files like those? What rule should I use create langdata files? >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com. >> To post to this group, send email to tesseract-ocr@googlegroups.com. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/ccc8505c-216f-450a-9627-d85b2c9e21a9%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/ccc8505c-216f-450a-9627-d85b2c9e21a9%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVXgCT2t_tOBcnbyLKav9Sg86FnntUZLJ-SicwXsiXxCg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.