see http://www.devscope.net/Content/ocrchecks.aspx https://github.com/BigPino67/Tesseract-MICR-OCR https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ
On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago <elmagoelg...@gmail.com> wrote: > That'll be nice if there's traineddata out there but I didn't find any. I > see free fonts and commercial OCR software but not traineddata. Tessdata > repository obviously doesn't have one, either. > > 2019年6月8日土曜日 1時52分10秒 UTC+9 shree: >> >> Please also search for existing MICR traineddata files. >> >> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago <elmago...@gmail.com> wrote: >> >>> So I did several tests from scratch. In the last attempt, I made a >>> training text with 4,000 lines in the following format, >>> >>> 110004310510< <02 :4002=0181:801= 0008752 <00039 ;0000001000; >>> >>> >>> and combined it with eng.digits.training_text in which symbols are >>> converted to E13B symbols. This makes about 12,000 lines of training >>> text. It's amazing that this thing generates a good reader out of >>> nowhere. But then it is not very good. For example: >>> >>> <01 :1901=1386:021= 1111001<10001< ;0000090134; >>> >>> is a result on the image attached. It's close but the last '<' in the >>> result text doesn't exist on the image. It's a small failure but it causes >>> a greater trouble in parsing. >>> >>> What would you suggest from here to increase accuracy? >>> >>> - Increase the number of lines in the training text >>> - Mix up more variations in the training text >>> - Increase the number of iterations >>> - Investigate wrong reads one by one >>> - Or else? >>> >>> Also, I referred to engrestrict*.* and could generate similar result >>> with the fine-tuning-from-full method. It seems a bit faster to get to the >>> same level but it also stops at a 'good' level. I can go with either way >>> if it takes me to the bright future. >>> >>> Regards, >>> ElMagoElGato >>> >>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago: >>>> >>>> Thanks a lot, Shree. I'll look it in. >>>> >>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree: >>>>> >>>>> See https://github.com/Shreeshrii/tessdata_shreetest >>>>> >>>>> Look at the files engrestrict*.* and also >>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text >>>>> >>>>> Create training text of about 100 lines and finetune for 400 lines >>>>> >>>>> >>>>> >>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago <elmago...@gmail.com> >>>>> wrote: >>>>> >>>>>> I had about 14 lines as attached. How many lines would you recommend? >>>>>> >>>>>> Fine tuning gives much better result but it tends to pick other >>>>>> character than in E13B that only has 14 characters, 0 through 9 and 4 >>>>>> symbols. I thought training from scratch would eliminate such confusion. >>>>>> >>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree: >>>>>>> >>>>>>> For training from scratch a large training text and hundreds of >>>>>>> thousands of iterations are recommended. >>>>>>> >>>>>>> If you are just fine tuning for a font try to follow instructions >>>>>>> for training for impact, with your font. >>>>>>> >>>>>>> >>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, <elmago...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks, Shree. >>>>>>>> >>>>>>>> Yes, I saw the instruction. The steps I made are as follows: >>>>>>>> >>>>>>>> Using tesstrain.sh: >>>>>>>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng >>>>>>>> --linedata_only \ >>>>>>>> --noextract_font_properties --langdata_dir ../langdata \ >>>>>>>> --tessdata_dir ./tessdata \ >>>>>>>> --fontlist "E13Bnsd" --output_dir ~/tesstutorial/e13beval \ >>>>>>>> --training_text ../langdata/eng/eng.training_e13b_text >>>>>>>> >>>>>>>> Training from scratch: >>>>>>>> mkdir -p ~/tesstutorial/e13boutput >>>>>>>> src/training/lstmtraining --debug_interval 100 \ >>>>>>>> --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>> --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 >>>>>>>> O1c111]' \ >>>>>>>> --model_output ~/tesstutorial/e13boutput/base --learning_rate >>>>>>>> 20e-4 \ >>>>>>>> --train_listfile ~/tesstutorial/e13beval/eng.training_files.txt \ >>>>>>>> --eval_listfile ~/tesstutorial/e13beval/eng.training_files.txt \ >>>>>>>> --max_iterations 5000 &>~/tesstutorial/e13boutput/basetrain.log >>>>>>>> >>>>>>>> Test with base_checkpoint: >>>>>>>> src/training/lstmeval --model >>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>> --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>> --eval_listfile ~/tesstutorial/e13beval/eng.training_files.txt >>>>>>>> >>>>>>>> Combining output files: >>>>>>>> src/training/lstmtraining --stop_training \ >>>>>>>> --continue_from ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>> --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>> --model_output ~/tesstutorial/e13boutput/eng.traineddata >>>>>>>> >>>>>>>> Test with eng.traineddata: >>>>>>>> tesseract e13b.png out --tessdata-dir >>>>>>>> /home/koichi/tesstutorial/e13boutput >>>>>>>> >>>>>>>> >>>>>>>> The training from scratch ended as: >>>>>>>> >>>>>>>> At iteration 561/2500/2500, Mean rms=0.159%, delta=0%, char >>>>>>>> train=0%, word train=0%, skip ratio=0%, New best char error = 0 wrote >>>>>>>> best >>>>>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint wrote >>>>>>>> checkpoint. >>>>>>>> >>>>>>>> >>>>>>>> The test with base_checkpoint returns nothing as: >>>>>>>> >>>>>>>> At iteration 0, stage 0, Eval Char error rate=0, Word error rate=0 >>>>>>>> >>>>>>>> >>>>>>>> The test with eng.traineddata and e13b.png returns out.txt. Both >>>>>>>> files are attached. >>>>>>>> >>>>>>>> Training seems to have worked fine. I don't know how to translate >>>>>>>> the test result from base_checkpoint. The generated eng.traineddata >>>>>>>> obviously doesn't work well. I suspect the choice of --traineddata in >>>>>>>> combining output files is bad but I have no clue. >>>>>>>> >>>>>>>> Regards, >>>>>>>> ElMagoElGato >>>>>>>> >>>>>>>> BTW, I referred to your tess4training in the process. It helped a >>>>>>>> lot. >>>>>>>> >>>>>>>> 2019年5月29日水曜日 19時14分08秒 UTC+9 shree: >>>>>>>>> >>>>>>>>> see >>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files >>>>>>>>> >>>>>>>>> On Wed, May 29, 2019 at 3:18 PM ElGato ElMago <elmago...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I wish to make a trained data for E13B font. >>>>>>>>>> >>>>>>>>>> I read the training tutorial and made a base_checkpoint file >>>>>>>>>> according to the method in Training From Scratch. Now, how can I >>>>>>>>>> make a >>>>>>>>>> trained data from the base_checkpoint file? >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> You received this message because you are subscribed to the >>>>>>>>>> Google Groups "tesseract-ocr" group. >>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>>>> send an email to tesser...@googlegroups.com. >>>>>>>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr >>>>>>>>>> . >>>>>>>>>> To view this discussion on the web visit >>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com >>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>> . >>>>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> ____________________________________________________________ >>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>>>>> >>>>>>>> -- >>>>>>>> You received this message because you are subscribed to the Google >>>>>>>> Groups "tesseract-ocr" group. >>>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>>> send an email to tesser...@googlegroups.com. >>>>>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>>> To view this discussion on the web visit >>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com >>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>> . >>>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>>> >>>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to tesser...@googlegroups.com. >>>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> ____________________________________________________________ >>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesser...@googlegroups.com. >>> To post to this group, send email to tesser...@googlegroups.com. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/5b151e61-5b41-4191-8d26-784809ef8e10%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/5b151e61-5b41-4191-8d26-784809ef8e10%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> >> -- >> >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/09d3119c-d093-4269-bf3a-3ddb467ed0ed%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/09d3119c-d093-4269-bf3a-3ddb467ed0ed%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWQY5i92PGxxqWbVH5N-bF9u%3Dmw5ZKe%3DQRCnQvftUjdbQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.