Re: [tesseract-ocr] what mean updatesubtrainer?

2019-06-18 Thread Pndaza
I wrongly gave old traineddata (mya-layer.traineddata) for lstmtraing --traineddata instead of starter traineddata. so i am retraining again. i will infrom result again. while extracting unicharset, unicharset_extractor say Two grapheme links in a row:0x103a 0x1039X > Invalid start of Myanmar s

[tesseract-ocr] Re: OCR pipeline with OpenCV

2019-06-18 Thread ElGato ElMago
Those images and fonts obviously are not for OCR. Need to improve images and train fonts. Do you only need to read temparatures? Then some pattern recognition method in OpenCV might be easier to work with. 2019年6月19日水曜日 7時16分21秒 UTC+9 Mox Betex: > Did you train Tesseract? > > Image is of poo

[tesseract-ocr] Re: Multiline tiff/txt

2019-06-18 Thread ElGato ElMago
To read with tesseract? Why not? 2019年6月18日火曜日 19時11分23秒 UTC+9 Mox Betex: > > Can I use multiline tiff/txt files instead of single line tiff/txt? > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receivi

[tesseract-ocr] Re: OCR pipeline with OpenCV

2019-06-18 Thread Mox Betex
Did you train Tesseract? Image is of poor quality for OCR, you have to improve it. Also check the resolution of image. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an em

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-18 Thread Jingjing Lin
Can you please test on arrows (↑ or ↓ ) instead of ± if it's not inconvenient for you? 在 2019年6月18日星期二 UTC-4下午2:21:18,shree写道: > > I will test tomorrow and let you know > > On Tu

[tesseract-ocr] Re: tesseract datasets

2019-06-18 Thread Jingjing Lin
I believe they are the .training_text files in langdata or langdata_lstm. https://github.com/tesseract-ocr 在 2019年6月18日星期二 UTC-4上午8:52:49,Lakshmikanth Chowdary写道: > > tesseract is trained using which datasets ?can someone help me how to get > those datasets > -- You received this message becaus

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-18 Thread Jingjing Lin
Thanks a lot! 在 2019年6月18日星期二 UTC-4下午2:21:18,shree写道: > > I will test tomorrow and let you know > > On Tue, 18 Jun 2019, 23:47 Jingjing Lin, > > wrote: > >> It still couldn't work after I increased the number of ± to about 100. >> And the error rate after 2000 iterations is about 11. This is a p

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-18 Thread Shree Devi Kumar
I will test tomorrow and let you know On Tue, 18 Jun 2019, 23:47 Jingjing Lin, wrote: > It still couldn't work after I increased the number of ± to about 100. And > the error rate after 2000 iterations is about 11. This is a pretty high > error rate compare to what we have for adding a few chara

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-18 Thread Jingjing Lin
It still couldn't work after I increased the number of ± to about 100. And the error rate after 2000 iterations is about 11. This is a pretty high error rate compare to what we have for adding a few characters to eng. With such high error rate, I would not be surprised that it could't recognize

Re: [tesseract-ocr] what mean updatesubtrainer?

2019-06-18 Thread Pndaza
*eval result.* At iteration 0, stage 0, Eval Char error rate=0.44462951, Word error rate=2.4380774 how can i tune to get error rate lower than 0.1% On Tuesday, 18 June 2019 22:11:09 UTC+6:30, shree wrote: > > Convert a few checkpoints to trained data / run lstmeval on them. > > You don't wan

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

2019-06-18 Thread fady taher
it seems the problem was copying *langdata *from windows to linux, I have redownload them on linux and it worked, will retry again. Thanks alot shree for your support On Tue, Jun 18, 2019 at 5:38 PM Shree Devi Kumar wrote: > Have you modified any word lists, training_text etc? > > What is your

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

2019-06-18 Thread fady taher
it seems the problem was copying *langdata *from windows to linux, I have redownload them on linux and it worked, will retry again On Tue, 18 Jun 2019, 5:21 pm fady taher, wrote: > the output of > > *src/training/tesstrain.sh --fontlist "Times New Roman" --lang eng > --linedata_only --noextr

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

2019-06-18 Thread Shree Devi Kumar
Have you modified any word lists, training_text etc? What is your tesseract version? Which o/s? On Tue, 18 Jun 2019, 20:51 fady taher, wrote: > the output of > > *src/training/tesstrain.sh --fontlist "Times New Roman" --lang eng > --linedata_only --noextract_font_properties --langdata_dir

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

2019-06-18 Thread fady taher
the output of *src/training/tesstrain.sh --fontlist "Times New Roman" --lang eng --linedata_only --noextract_font_properties --langdata_dir /home/sw/repo/langdata --tessdata_dir /home/sw/repo/tessdata --output_dir ~/tesstutorial/trainplusminus* is *[Tue Jun 18 17:19:46

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

2019-06-18 Thread Shree Devi Kumar
That means src/training/tesstrain.sh --fontlist "Times New Roman" --lang eng --linedata_only --noextract_font_properties --langdata_dir /home/sw/repo/langdata --tessdata_dir /home/sw/repo/tessdata --output_dir ~/tesstutorial/trainplusminus did not complete correctly. On Tue, Jun 18, 2019 at

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

2019-06-18 Thread fady taher
Nop, this file doesn't exist yet only contains *eng.charset_size=110.txt* *eng.unicharset* On Tue, Jun 18, 2019 at 4:46 PM Shree Devi Kumar wrote: > Check ~/tesstutorial/trainplusminus > Did your earlier training complete correctly? Does > ~/tesstutorial/trainplusminus/eng/eng.traineddata exis

Re: [tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

2019-06-18 Thread Shree Devi Kumar
Check ~/tesstutorial/trainplusminus Did your earlier training complete correctly? Does ~/tesstutorial/trainplusminus/eng/eng.traineddata exist? On Tue, Jun 18, 2019 at 8:11 PM fady taher wrote: > Am trying to fine tune tesseract > > but I keep getting the error *mgr_.Init(traineddata_path.c_str(

[tesseract-ocr] mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110

2019-06-18 Thread fady taher
Am trying to fine tune tesseract but I keep getting the error *mgr_.Init(traineddata_path.c_str()):Error:Assert failed:in file ../../src/lstm/lstmtrainer.h, line 110 *on the training statement. My script looks as follows cd /home/sw/repo/tesseract-ocr mkdir -p ~/tesstutorial/ mkdir -p ~/te

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-18 Thread Shree Devi Kumar
increase the number of ± to about 100 On Tue, Jun 18, 2019 at 7:39 PM Jingjing Lin wrote: > Sorry to bother you again and again. > I reduced the training text to about 450 lines, with like 30 ± in it. I > used two fonts and iteration of 1000. But it looks like ± is still not > picked up by the

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-18 Thread Jingjing Lin
Sorry to bother you again and again. I reduced the training text to about 450 lines, with like 30 ± in it. I used two fonts and iteration of 1000. But it looks like ± is still not picked up by the BEST OCR TEXT at all, it always recognizes ± as something else. What is happening here? Should I in

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-18 Thread Jingjing Lin
Thanks for your advice. I'll try reduce the training text size. 在 2019年6月18日星期二 UTC-4上午12:28:25,shree写道: > > If you increase the iterations then the plus type of training will not > give good result, i.e. the other letters will lose accuracy. > > You can try to reduce the training text size while

[tesseract-ocr] tesseract datasets

2019-06-18 Thread Lakshmikanth Chowdary
tesseract is trained using which datasets ?can someone help me how to get those datasets -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr

[tesseract-ocr] Multiline tiff/txt

2019-06-18 Thread Mox Betex
Can I use multiline tiff/txt files instead of single line tiff/txt? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com.

Re: [tesseract-ocr] Custom Tiff/Box pairs support in tesstrain.sh

2019-06-18 Thread Shree Devi Kumar
It should work if your files follow similar naming convention. lang.xxxnnn.exp0.tif lang.xxxnnn.exp0.box Where lang is your language code eg. eng xxxnnn is any unique random string (fontname in files generated by text2image) On Tue, Jun 18, 2019 at 2:54 PM hrishikesh kaulwar wrote: > Greeti

[tesseract-ocr] Custom Tiff/Box pairs support in tesstrain.sh

2019-06-18 Thread hrishikesh kaulwar
Greetings, I just got to know that tesstrain.sh is modified to support user provided box/tiff pairs by adding a tiff/box directory flag. I used that version of tesseract source to use my own tiff/box pairs. But when I ran tesstrain.sh I got to know that it just copies tiff/box pairs provided

Re: [tesseract-ocr] Re: FontAwesome and Tesseract

2019-06-18 Thread Lorenzo Bolzani
How many different chars do you need to detect? What is the size range (in pixels)? What kind of images, scans, smartphone pictures, screenshots? If you just want to locate the symbols something like opencv matchTemplate may