That'll be nice if there's traineddata out there but I didn't find any. I see free fonts and commercial OCR software but not traineddata. Tessdata repository obviously doesn't have one, either.
2019年6月8日土曜日 1時52分10秒 UTC+9 shree: > > Please also search for existing MICR traineddata files. > > On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago <elmago...@gmail.com > <javascript:>> wrote: > >> So I did several tests from scratch. In the last attempt, I made a >> training text with 4,000 lines in the following format, >> >> 110004310510< <02 :4002=0181:801= 0008752 <00039 ;0000001000; >> >> >> and combined it with eng.digits.training_text in which symbols are >> converted to E13B symbols. This makes about 12,000 lines of training >> text. It's amazing that this thing generates a good reader out of >> nowhere. But then it is not very good. For example: >> >> <01 :1901=1386:021= 1111001<10001< ;0000090134; >> >> is a result on the image attached. It's close but the last '<' in the >> result text doesn't exist on the image. It's a small failure but it causes >> a greater trouble in parsing. >> >> What would you suggest from here to increase accuracy? >> >> - Increase the number of lines in the training text >> - Mix up more variations in the training text >> - Increase the number of iterations >> - Investigate wrong reads one by one >> - Or else? >> >> Also, I referred to engrestrict*.* and could generate similar result with >> the fine-tuning-from-full method. It seems a bit faster to get to the same >> level but it also stops at a 'good' level. I can go with either way if it >> takes me to the bright future. >> >> Regards, >> ElMagoElGato >> >> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago: >>> >>> Thanks a lot, Shree. I'll look it in. >>> >>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree: >>>> >>>> See https://github.com/Shreeshrii/tessdata_shreetest >>>> >>>> Look at the files engrestrict*.* and also >>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text >>>> >>>> Create training text of about 100 lines and finetune for 400 lines >>>> >>>> >>>> >>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago <elmago...@gmail.com> >>>> wrote: >>>> >>>>> I had about 14 lines as attached. How many lines would you recommend? >>>>> >>>>> Fine tuning gives much better result but it tends to pick other >>>>> character than in E13B that only has 14 characters, 0 through 9 and 4 >>>>> symbols. I thought training from scratch would eliminate such confusion. >>>>> >>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree: >>>>>> >>>>>> For training from scratch a large training text and hundreds of >>>>>> thousands of iterations are recommended. >>>>>> >>>>>> If you are just fine tuning for a font try to follow instructions for >>>>>> training for impact, with your font. >>>>>> >>>>>> >>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, <elmago...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Thanks, Shree. >>>>>>> >>>>>>> Yes, I saw the instruction. The steps I made are as follows: >>>>>>> >>>>>>> Using tesstrain.sh: >>>>>>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng >>>>>>> --linedata_only \ >>>>>>> --noextract_font_properties --langdata_dir ../langdata \ >>>>>>> --tessdata_dir ./tessdata \ >>>>>>> --fontlist "E13Bnsd" --output_dir ~/tesstutorial/e13beval \ >>>>>>> --training_text ../langdata/eng/eng.training_e13b_text >>>>>>> >>>>>>> Training from scratch: >>>>>>> mkdir -p ~/tesstutorial/e13boutput >>>>>>> src/training/lstmtraining --debug_interval 100 \ >>>>>>> --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>> --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 >>>>>>> O1c111]' \ >>>>>>> --model_output ~/tesstutorial/e13boutput/base --learning_rate >>>>>>> 20e-4 \ >>>>>>> --train_listfile ~/tesstutorial/e13beval/eng.training_files.txt \ >>>>>>> --eval_listfile ~/tesstutorial/e13beval/eng.training_files.txt \ >>>>>>> --max_iterations 5000 &>~/tesstutorial/e13boutput/basetrain.log >>>>>>> >>>>>>> Test with base_checkpoint: >>>>>>> src/training/lstmeval --model >>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>> --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>> --eval_listfile ~/tesstutorial/e13beval/eng.training_files.txt >>>>>>> >>>>>>> Combining output files: >>>>>>> src/training/lstmtraining --stop_training \ >>>>>>> --continue_from ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>> --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>> --model_output ~/tesstutorial/e13boutput/eng.traineddata >>>>>>> >>>>>>> Test with eng.traineddata: >>>>>>> tesseract e13b.png out --tessdata-dir >>>>>>> /home/koichi/tesstutorial/e13boutput >>>>>>> >>>>>>> >>>>>>> The training from scratch ended as: >>>>>>> >>>>>>> At iteration 561/2500/2500, Mean rms=0.159%, delta=0%, char >>>>>>> train=0%, word train=0%, skip ratio=0%, New best char error = 0 wrote >>>>>>> best >>>>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint wrote >>>>>>> checkpoint. >>>>>>> >>>>>>> >>>>>>> The test with base_checkpoint returns nothing as: >>>>>>> >>>>>>> At iteration 0, stage 0, Eval Char error rate=0, Word error rate=0 >>>>>>> >>>>>>> >>>>>>> The test with eng.traineddata and e13b.png returns out.txt. Both >>>>>>> files are attached. >>>>>>> >>>>>>> Training seems to have worked fine. I don't know how to translate >>>>>>> the test result from base_checkpoint. The generated eng.traineddata >>>>>>> obviously doesn't work well. I suspect the choice of --traineddata in >>>>>>> combining output files is bad but I have no clue. >>>>>>> >>>>>>> Regards, >>>>>>> ElMagoElGato >>>>>>> >>>>>>> BTW, I referred to your tess4training in the process. It helped a >>>>>>> lot. >>>>>>> >>>>>>> 2019年5月29日水曜日 19時14分08秒 UTC+9 shree: >>>>>>>> >>>>>>>> see >>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files >>>>>>>> >>>>>>>> On Wed, May 29, 2019 at 3:18 PM ElGato ElMago <elmago...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I wish to make a trained data for E13B font. >>>>>>>>> >>>>>>>>> I read the training tutorial and made a base_checkpoint file >>>>>>>>> according to the method in Training From Scratch. Now, how can I >>>>>>>>> make a >>>>>>>>> trained data from the base_checkpoint file? >>>>>>>>> >>>>>>>>> -- >>>>>>>>> You received this message because you are subscribed to the Google >>>>>>>>> Groups "tesseract-ocr" group. >>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>> send an email to tesser...@googlegroups.com. >>>>>>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>>>> To view this discussion on the web visit >>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com >>>>>>>>> >>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>> . >>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> ____________________________________________________________ >>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to tesser...@googlegroups.com. >>>>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com >>>>>>> >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesser...@googlegroups.com. >>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com >>>>> >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> >>>> -- >>>> >>>> ____________________________________________________________ >>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesser...@googlegroups.com <javascript:>. >> To post to this group, send email to tesser...@googlegroups.com >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/5b151e61-5b41-4191-8d26-784809ef8e10%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/5b151e61-5b41-4191-8d26-784809ef8e10%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > > -- > > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/09d3119c-d093-4269-bf3a-3ddb467ed0ed%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.