Please also search for existing MICR traineddata files. On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago <elmagoelg...@gmail.com> wrote:
> So I did several tests from scratch. In the last attempt, I made a > training text with 4,000 lines in the following format, > > 110004310510< <02 :4002=0181:801= 0008752 <00039 ;0000001000; > > > and combined it with eng.digits.training_text in which symbols are > converted to E13B symbols. This makes about 12,000 lines of training > text. It's amazing that this thing generates a good reader out of > nowhere. But then it is not very good. For example: > > <01 :1901=1386:021= 1111001<10001< ;0000090134; > > is a result on the image attached. It's close but the last '<' in the > result text doesn't exist on the image. It's a small failure but it causes > a greater trouble in parsing. > > What would you suggest from here to increase accuracy? > > - Increase the number of lines in the training text > - Mix up more variations in the training text > - Increase the number of iterations > - Investigate wrong reads one by one > - Or else? > > Also, I referred to engrestrict*.* and could generate similar result with > the fine-tuning-from-full method. It seems a bit faster to get to the same > level but it also stops at a 'good' level. I can go with either way if it > takes me to the bright future. > > Regards, > ElMagoElGato > > 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago: >> >> Thanks a lot, Shree. I'll look it in. >> >> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree: >>> >>> See https://github.com/Shreeshrii/tessdata_shreetest >>> >>> Look at the files engrestrict*.* and also >>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text >>> >>> Create training text of about 100 lines and finetune for 400 lines >>> >>> >>> >>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago <elmago...@gmail.com> >>> wrote: >>> >>>> I had about 14 lines as attached. How many lines would you recommend? >>>> >>>> Fine tuning gives much better result but it tends to pick other >>>> character than in E13B that only has 14 characters, 0 through 9 and 4 >>>> symbols. I thought training from scratch would eliminate such confusion. >>>> >>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree: >>>>> >>>>> For training from scratch a large training text and hundreds of >>>>> thousands of iterations are recommended. >>>>> >>>>> If you are just fine tuning for a font try to follow instructions for >>>>> training for impact, with your font. >>>>> >>>>> >>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, <elmago...@gmail.com> wrote: >>>>> >>>>>> Thanks, Shree. >>>>>> >>>>>> Yes, I saw the instruction. The steps I made are as follows: >>>>>> >>>>>> Using tesstrain.sh: >>>>>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng >>>>>> --linedata_only \ >>>>>> --noextract_font_properties --langdata_dir ../langdata \ >>>>>> --tessdata_dir ./tessdata \ >>>>>> --fontlist "E13Bnsd" --output_dir ~/tesstutorial/e13beval \ >>>>>> --training_text ../langdata/eng/eng.training_e13b_text >>>>>> >>>>>> Training from scratch: >>>>>> mkdir -p ~/tesstutorial/e13boutput >>>>>> src/training/lstmtraining --debug_interval 100 \ >>>>>> --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>> --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 >>>>>> O1c111]' \ >>>>>> --model_output ~/tesstutorial/e13boutput/base --learning_rate 20e-4 >>>>>> \ >>>>>> --train_listfile ~/tesstutorial/e13beval/eng.training_files.txt \ >>>>>> --eval_listfile ~/tesstutorial/e13beval/eng.training_files.txt \ >>>>>> --max_iterations 5000 &>~/tesstutorial/e13boutput/basetrain.log >>>>>> >>>>>> Test with base_checkpoint: >>>>>> src/training/lstmeval --model >>>>>> ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>> --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>> --eval_listfile ~/tesstutorial/e13beval/eng.training_files.txt >>>>>> >>>>>> Combining output files: >>>>>> src/training/lstmtraining --stop_training \ >>>>>> --continue_from ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>> --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>> --model_output ~/tesstutorial/e13boutput/eng.traineddata >>>>>> >>>>>> Test with eng.traineddata: >>>>>> tesseract e13b.png out --tessdata-dir >>>>>> /home/koichi/tesstutorial/e13boutput >>>>>> >>>>>> >>>>>> The training from scratch ended as: >>>>>> >>>>>> At iteration 561/2500/2500, Mean rms=0.159%, delta=0%, char train=0%, >>>>>> word train=0%, skip ratio=0%, New best char error = 0 wrote best >>>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint wrote >>>>>> checkpoint. >>>>>> >>>>>> >>>>>> The test with base_checkpoint returns nothing as: >>>>>> >>>>>> At iteration 0, stage 0, Eval Char error rate=0, Word error rate=0 >>>>>> >>>>>> >>>>>> The test with eng.traineddata and e13b.png returns out.txt. Both >>>>>> files are attached. >>>>>> >>>>>> Training seems to have worked fine. I don't know how to translate >>>>>> the test result from base_checkpoint. The generated eng.traineddata >>>>>> obviously doesn't work well. I suspect the choice of --traineddata in >>>>>> combining output files is bad but I have no clue. >>>>>> >>>>>> Regards, >>>>>> ElMagoElGato >>>>>> >>>>>> BTW, I referred to your tess4training in the process. It helped a >>>>>> lot. >>>>>> >>>>>> 2019年5月29日水曜日 19時14分08秒 UTC+9 shree: >>>>>>> >>>>>>> see >>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files >>>>>>> >>>>>>> On Wed, May 29, 2019 at 3:18 PM ElGato ElMago <elmago...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I wish to make a trained data for E13B font. >>>>>>>> >>>>>>>> I read the training tutorial and made a base_checkpoint file >>>>>>>> according to the method in Training From Scratch. Now, how can I make >>>>>>>> a >>>>>>>> trained data from the base_checkpoint file? >>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "tesseract-ocr" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to tesser...@googlegroups.com. >>>>>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com >>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> ____________________________________________________________ >>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to tesser...@googlegroups.com. >>>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesser...@googlegroups.com. >>>> To post to this group, send email to tesser...@googlegroups.com. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> >>> -- >>> >>> ____________________________________________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/5b151e61-5b41-4191-8d26-784809ef8e10%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/5b151e61-5b41-4191-8d26-784809ef8e10%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUT-Ut2LA9h49u8J7SZumhrA%3DV__pwdVNJ%2B%2BpB_0juFsg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.