Re: [tesseract-ocr] traning devanagari: »Encoding of string failed!«

2019-04-23 Thread Jochen Barth
Dear Shree, I've tried it with the format below and combined letter-and-sign-symbols (see attached file) and with WordStr-Format (see attached file), but still the same error... Kind regards, Jochen Am 18.04.19 um 17:40 schrieb Shree Devi Kumar: The following format (as in your box file) will

Re: [tesseract-ocr] traning devanagari: »Encoding of string failed!«

2019-04-23 Thread Shree Devi Kumar
Hello Jochen, I prefer the Wordstr format since it is easier to correct the text with ground truth, so I have not tested with the lstmbox file. A cursory glance at the file shows that the lstmbox file does not have lines with spaces between words. Another point to remember when training with ima

Re: [tesseract-ocr] traning devanagari: »Encoding of string failed!«

2019-04-23 Thread Shree Devi Kumar
zip file is too big. Let me do an alternative upload. Training runs ok for me - Warning: LSTMTrainer deserialized an LSTMRecognizer! Continuing from /home/ubuntu/tessdata_best/script/Devanagari.lstm Loaded 13/13 lines (1-13) of document NKP/dp10.lstmf Loaded 13/13 lines (1-13) of document NKP/dp1

Re: [tesseract-ocr] traning devanagari: »Encoding of string failed!«

2019-04-23 Thread Shree Devi Kumar
Uploaded the files at https://github.com/Shreeshrii/tessdata_sanskrit See NKP.sh and folder NKP The first part of the script loops through the images and creates Wordstr box files for same using tesseract. It then uses sed to replace the reognized text by the ground truth. This corrected box file

Re: [tesseract-ocr] traning devanagari: »Encoding of string failed!«

2019-04-23 Thread Jochen Barth
Thanks a lot. The error seems to be the missing space after the tab character in line below »WordStr«! Kind regards, Jochen Am 23.04.19 um 12:02 schrieb Shree Devi Kumar: Uploaded the files at https://github.com/Shreeshrii/tessdata_sanskrit See NKP.sh and folder NKP The first part of the

Re: [tesseract-ocr] traning devanagari: »Encoding of string failed!«

2019-04-23 Thread Shree Devi Kumar
Glad you figured out the problem. Please consider sharing the improved traineddata file (when you complete training) for tessdata_contrib repo. On Tue, 23 Apr 2019, 16:24 Jochen Barth, wrote: > Thanks a lot. > > The error seems to be the missing space after the tab character in line > below »Wo

[tesseract-ocr] Why having three different forms for a word in eng.lstm-word-dawg?

2019-04-23 Thread Hongyu Zhou
It seems like there are three forms for a word stored in the eng.lstm-word-dawg. For example the word 'book' has three different forms: lower case (book), upper case (BOOK) and caption case (Book). When we check whether a word is in the dictionary or not, do we really care about their forms? Whe

Re: [tesseract-ocr] unable to "make training"

2019-04-23 Thread Tairen Chen
Hi, Zdenko, My ".configure" log is following and I think I found the issue. Let me post my log file first: """ checking for g++... g++ checking whether the C++ compiler works... yes checking for C++ compiler default output file name... a.out checking for suffix of executables... checking wheth

Re: [tesseract-ocr] small image and OCR

2019-04-23 Thread alex kelly
Thanks for getting back to me. When i run it i get an error, any ideas why and how to resolve it? pi@ShopFloorOCRReader:~ $ tesseract --psm 6 "test_images/cropped_image.jpg" Tesseract Open Source OCR Engine v3.04.01 with Leptonica read_params_file: parameter not found: On Sunday, 14 Ap

Re: [tesseract-ocr] small image and OCR

2019-04-23 Thread Zdenko Podobny
What about: tesseract --help ;-) Zdenko ut 23. 4. 2019 o 16:59 alex kelly napísal(a): > Thanks for getting back to me. When i run it i get an error, any ideas > why and how to resolve it? > > pi@ShopFloorOCRReader:~ $ tesseract --psm 6 > "test_images/cropped_image.jpg" > Tesseract Open Source

[tesseract-ocr] what is the size of fine tuned traineddata

2019-04-23 Thread Shanshan Wang
Hi, I used the ocrd_train for fine tune training, the start model is eng.traineddata in tessdata_best. the traineddata file I got after training is around 11-12MB, less than the original eng.traineddata which is 15MB. Is it normal? or I have done something wrong during training? I have added

Re: [tesseract-ocr] small image and OCR

2019-04-23 Thread Lorenzo Bolzani
Hi, I suspect you did a cut and paste or some edits and now you have some non-printable characters in your command line (the question mark boxes). Write it again from scratch. And you are missing one parameter in the command line, the output file, you can use "-" for standard output. $ tesseract

Re: [tesseract-ocr] unable to "make training"

2019-04-23 Thread Zdenko Podobny
please provide output of command: dpkg -l | cut -d " " -f 3 | grep "icu\|cairo\|pango\|pkg-config" and config.log file (you can compress it with e.g. gzip before sending) Zdenko ut 23. 4. 2019 o 16:30 Tairen Chen napísal(a): > Hi, Zdenko, > My ".configure" log is following and I think I

Re: [tesseract-ocr] unable to "make training"

2019-04-23 Thread Tairen Chen
Hi, Zdenko, Thank you for your reply. I uninstall every package that I had installed and remove unzip the packages. Then, I follow the link to install again. Now the training is working. :-) Thank you for pointing me to the configure file and I understand what is "

Re: [tesseract-ocr] Re: Can I use this way for fine tuning?

2019-04-23 Thread 易鑫
>Why not just use ocrd for fine tune training? Just set up your START_MODEL as chi_sim. Because I have trained a chi_sim model from Tesseract-OCR, and I don't have too many sample images. Shanshan Wang 于2019年4月22日周一 下午8:34写道: > Why not just use ocrd for fine tune training? Just set up your STA

[tesseract-ocr] How to integrating Tesseract 4.0 to android platform

2019-04-23 Thread Ravneet Kaur
I am using Tess-Two Library to train Punjabi and Hindi language data for android platform (Android Studio 3.3.1) and getting space issue (recognized words are not separated by spaces). While searching for solution I found that this issue is resolved in tesseract 4.0, but how can I use the same for