HI, I'm thinking of sharing it of course. What is the best way to do it? After all this, the contribution part of mine is only how I prepared the training text. Even that is consist of Shree's text and mine. The instructions and tools I used already exist.
ElMagoElGato 2019年8月7日水曜日 8時20分02秒 UTC+9 Mamadou: > Hello, > Are you planning to release the dataset or models? > I'm working on the same subject and planning to share both under BSD terms > > On Tuesday, August 6, 2019 at 10:11:40 AM UTC+2, ElGato ElMago wrote: >> >> Hi, >> >> FWIW, I got to the point where I can feel happy with the accuracy. As the >> images of the previous post show, the symbols, especially on-us symbol and >> amount symbol, were causing mix-up each other or to another character. I >> added much more more symbols to the training text and formed words that >> start with a symbol. One example is as follows: >> >> 9;:;=;<;< <0<1<3<4;6;8;9;:;=; >> >> >> I randomly made 8,000 lines like this. In fine-tuning from eng, 5,000 >> iteration was almost good. Amount symbol still is confused a little when >> it's followed by 0. Fine tuning tends to be dragged by small particles. >> I'll have to think of something to make further improvement. >> >> Training from scratch produced a bit more stable traineddata. It doesn't >> get confused with symbols so often but tends to generate extra spaces. By >> 10,000 iterations, those spaces are gone and recognition became very solid. >> >> I thought I might have to do image and box file training but I guess it's >> not needed this time. >> >> ElMagoElGato >> >> 2019年7月26日金曜日 14時08分06秒 UTC+9 ElGato ElMago: >>> >>> HI, >>> >>> Well, I read the description of ScrollView ( >>> https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging) and it >>> says: >>> >>> To show the characters, deselect DISPLAY/Bounding Boxes, select >>> DISPLAY/Polygonal Approx and then select OTHER/Uniform display. >>> >>> >>> It basically works. But for some reason, it doesn't work on my e13b >>> image and ends up with a blue screen. Anyway, it shows each box separately >>> when a character is consist of multiple boxes. I'd like to show the box >>> for the whole character. ScrollView doesn't do it, at least, yet. I'll do >>> it on my own. >>> >>> ElMagoElGato >>> >>> 2019年7月24日水曜日 14時10分46秒 UTC+9 ElGato ElMago: >>>> >>>> Hi, >>>> >>>> >>>> I got this result from hocr. This is where one of the phantom >>>> characters comes from. >>>> >>>> <span class='ocrx_cinfo' title='x_bboxes 1259 902 1262 933; x_conf >>>> 98.864532'><</span> >>>> <span class='ocrx_cinfo' title='x_bboxes 1259 904 1281 933; x_conf >>>> 99.018097'>;</span> >>>> >>>> >>>> The firs character is the phantom. It starts with the second character >>>> that exists on x axis. The first character only has 3 points width. I >>>> attach ScrollView screen shots that visualize this. >>>> >>>> [image: 2019-07-24-132643_854x707_scrot.png][image: >>>> 2019-07-24-132800_854x707_scrot.png] >>>> >>>> >>>> There seem to be some more cases to cause phantom characters. I'll >>>> look them in. But I have a trivial question now. I made ScrollView show >>>> these displays by accidentally clicking Display->Blamer menu. There is >>>> Bounding Boxes menu below but it ends up showing a blue screen though it >>>> briefly shows boxes on the way. Can I use this menu at all? It'll be >>>> very >>>> useful. >>>> >>>> [image: 2019-07-24-140739_854x707_scrot.png] >>>> >>>> >>>> 2019年7月23日火曜日 17時10分36秒 UTC+9 ElGato ElMago: >>>>> >>>>> It's great! Perfect! Thanks a lot! >>>>> >>>>> 2019年7月23日火曜日 10時56分58秒 UTC+9 shree: >>>>>> >>>>>> See https://github.com/tesseract-ocr/tesseract/issues/2580 >>>>>> >>>>>> On Tue, 23 Jul 2019, 06:23 ElGato ElMago, <elmago...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I read the output of hocr with lstm_choice_mode = 4 as to the pull >>>>>>> request 2554. It shows the candidates for each character but doesn't >>>>>>> show >>>>>>> bounding box of each character. I only shows the box for a whole word. >>>>>>> >>>>>>> I see bounding boxes of each character in comments of the pull >>>>>>> request 2576. How can I do that? Do I have to look in the source code >>>>>>> and >>>>>>> manipulate such an output on my own? >>>>>>> >>>>>>> 2019年7月19日金曜日 18時40分49秒 UTC+9 ElGato ElMago: >>>>>>> >>>>>>>> Lorenzo, >>>>>>>> >>>>>>>> I haven't been checking psm too much. Will turn to those options >>>>>>>> after I see how it goes with bounding boxes. >>>>>>>> >>>>>>>> Shree, >>>>>>>> >>>>>>>> I see the merges in the git log and also see that new >>>>>>>> option lstm_choice_amount works now. I guess my executable is latest >>>>>>>> though I still see the phantom character. Hocr makes huge and complex >>>>>>>> output. I'll take some to read it. >>>>>>>> >>>>>>>> 2019年7月19日金曜日 18時20分55秒 UTC+9 Claudiu: >>>>>>>>> >>>>>>>>> Is there any way to pass bounding boxes to use to the LSTM? We >>>>>>>>> have an algorithm that cleanly gets bounding boxes of MRZ characters. >>>>>>>>> However the results using psm 10 are worse than passing the whole >>>>>>>>> line in. >>>>>>>>> Yet when we pass the whole line in we get these phantom characters. >>>>>>>>> >>>>>>>>> Should PSM 10 mode work? It often returns “no character” where >>>>>>>>> there clearly is one. I can supply a test case if it is expected to >>>>>>>>> work >>>>>>>>> well. >>>>>>>>> >>>>>>>>> On Fri, Jul 19, 2019 at 11:06 AM ElGato ElMago < >>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Lorenzo, >>>>>>>>>> >>>>>>>>>> We both have got the same case. It seems a solution to this >>>>>>>>>> problem would save a lot of people. >>>>>>>>>> >>>>>>>>>> Shree, >>>>>>>>>> >>>>>>>>>> I pulled the current head of master branch but it doesn't seem to >>>>>>>>>> contain the merges you pointed that have been merged 3 to 4 days >>>>>>>>>> ago. How >>>>>>>>>> can I get them? >>>>>>>>>> >>>>>>>>>> ElMagoElGato >>>>>>>>>> >>>>>>>>>> 2019年7月19日金曜日 17時02分53秒 UTC+9 Lorenzo Blz: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> PSM 7 was a partial solution for my specific case, it improved >>>>>>>>>>> the situation but did not solve it. Also I could not use it in some >>>>>>>>>>> other >>>>>>>>>>> cases. >>>>>>>>>>> >>>>>>>>>>> The proper solution is very likely doing more training with more >>>>>>>>>>> data, some data augmentation might probably help if data is scarce. >>>>>>>>>>> Also doing less training might help is the training is not done >>>>>>>>>>> correctly. >>>>>>>>>>> >>>>>>>>>>> There are also similar issues on github: >>>>>>>>>>> >>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1465 >>>>>>>>>>> ... >>>>>>>>>>> >>>>>>>>>>> The LSTM engine works like this: it scans the image and for each >>>>>>>>>>> "pixel column" does this: >>>>>>>>>>> >>>>>>>>>>> M M M M N M M M [BLANK] F F F F >>>>>>>>>>> >>>>>>>>>>> (here i report only the highest probability characters) >>>>>>>>>>> >>>>>>>>>>> In the example above an M is partially seen as an N, this is >>>>>>>>>>> normal, and another step of the algorithm (beam search I think) >>>>>>>>>>> tries to >>>>>>>>>>> aggregate back the correct characters. >>>>>>>>>>> >>>>>>>>>>> I think cases like this: >>>>>>>>>>> >>>>>>>>>>> M M M N N N M M >>>>>>>>>>> >>>>>>>>>>> are what gives the phantom characters. More training should >>>>>>>>>>> reduce the source of the problem or a painful analysis of the >>>>>>>>>>> bounding >>>>>>>>>>> boxes might fix some cases. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I used the attached script for the boxes. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Lorenzo >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Il giorno ven 19 lug 2019 alle ore 07:25 ElGato ElMago < >>>>>>>>>>> elmago...@gmail.com> ha scritto: >>>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> Let's call them phantom characters then. >>>>>>>>>>>> >>>>>>>>>>>> Was psm 7 the solution for the issue 1778? None of the psm >>>>>>>>>>>> option didn't solve my problem though I see different output. >>>>>>>>>>>> >>>>>>>>>>>> I use tesseract 5.0-alpha mostly but 4.1 showed the same >>>>>>>>>>>> results anyway. How did you get bounding box for each character? >>>>>>>>>>>> Alto and >>>>>>>>>>>> lstmbox only show bbox for a group of characters. >>>>>>>>>>>> >>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>> >>>>>>>>>>>> 2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz: >>>>>>>>>>>> >>>>>>>>>>>>> Phantom characters here for me too: >>>>>>>>>>>>> >>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1778 >>>>>>>>>>>>> >>>>>>>>>>>>> Are you using 4.1? Bounding boxes were fixed in 4.1 maybe this >>>>>>>>>>>>> was also improved. >>>>>>>>>>>>> >>>>>>>>>>>>> I wrote some code that uses symbols iterator to discard >>>>>>>>>>>>> symbols that are clearly duplicated: too small, overlapping, etc. >>>>>>>>>>>>> But it >>>>>>>>>>>>> was not easy to make it work decently and it is not 100% reliable >>>>>>>>>>>>> with >>>>>>>>>>>>> false negatives and positives. I cannot share the code and it is >>>>>>>>>>>>> quite ugly >>>>>>>>>>>>> anyway. >>>>>>>>>>>>> >>>>>>>>>>>>> Here there is another MRZ model with training data: >>>>>>>>>>>>> >>>>>>>>>>>>> https://github.com/DoubangoTelecom/tesseractMRZ >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Lorenzo >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu < >>>>>>>>>>>>> csaf...@gmail.com> ha scritto: >>>>>>>>>>>>> >>>>>>>>>>>>>> I’m getting the “phantom character” issue as well using the >>>>>>>>>>>>>> OCRB that Shree trained on MRZ lines. For example for a 0 it >>>>>>>>>>>>>> will sometimes >>>>>>>>>>>>>> add both a 0 and an O to the output , thus outputting 45 >>>>>>>>>>>>>> characters total >>>>>>>>>>>>>> instead of 44. I haven’t looked at the bounding box output yet >>>>>>>>>>>>>> but I >>>>>>>>>>>>>> suspect a phantom thin character is added somewhere that I can >>>>>>>>>>>>>> discard .. >>>>>>>>>>>>>> or maybe two chars will have the same bounding box. If anyone >>>>>>>>>>>>>> else has >>>>>>>>>>>>>> fixed this issue further up (eg so the output doesn’t contain >>>>>>>>>>>>>> the phantom >>>>>>>>>>>>>> characters in the first place) id be interested. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago < >>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'll go back to more of training later. Before doing so, >>>>>>>>>>>>>>> I'd like to investigate results a little bit. The hocr and >>>>>>>>>>>>>>> lstmbox options >>>>>>>>>>>>>>> give some details of positions of characters. The results show >>>>>>>>>>>>>>> positions >>>>>>>>>>>>>>> that perfectly correspond to letters in the image. But the >>>>>>>>>>>>>>> text output >>>>>>>>>>>>>>> contains a character that obviously does not exist. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Then I found a config file 'lstmdebug' that generates far >>>>>>>>>>>>>>> more information. I hope it explains what happened with each >>>>>>>>>>>>>>> character. >>>>>>>>>>>>>>> I'm yet to read the debug output but I'd appreciate it if >>>>>>>>>>>>>>> someone could >>>>>>>>>>>>>>> tell me how to read it because it's really complex. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019年6月14日金曜日 19時58分49秒 UTC+9 shree: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_MICR >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I have uploaded my files there. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh >>>>>>>>>>>>>>>> is the bash script that runs the training. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> You can modify as needed. Please note this is for >>>>>>>>>>>>>>>> legacy/base tesseract --oem 0. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago < >>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks a lot, shree. It seems you know everything. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I tried the MICR0.traineddata and the first two >>>>>>>>>>>>>>>>> mcr.traineddata. The last one was blocked by the browser. >>>>>>>>>>>>>>>>> Each of the >>>>>>>>>>>>>>>>> traineddata had mixed results. All of them are getting >>>>>>>>>>>>>>>>> symbols fairly good >>>>>>>>>>>>>>>>> but getting spaces randomly and reading some numbers wrong. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> MICR0 seems the best among them. Did you suggest that >>>>>>>>>>>>>>>>> you'd be able to update it? It gets tripple D very often >>>>>>>>>>>>>>>>> where there's >>>>>>>>>>>>>>>>> only one, and so on. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Also, I tried to fine tune from MICR0 but I found that I >>>>>>>>>>>>>>>>> need to change the language-specific.sh. It specifies some >>>>>>>>>>>>>>>>> parameters for >>>>>>>>>>>>>>>>> each language. Do you have any guidance for it? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>>> http://www.devscope.net/Content/ocrchecks.aspx >>>>>>>>>>>>>>>>>> https://github.com/BigPino67/Tesseract-MICR-OCR >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago < >>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> That'll be nice if there's traineddata out there but I >>>>>>>>>>>>>>>>>>> didn't find any. I see free fonts and commercial OCR >>>>>>>>>>>>>>>>>>> software but not >>>>>>>>>>>>>>>>>>> traineddata. Tessdata repository obviously doesn't have >>>>>>>>>>>>>>>>>>> one, either. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Please also search for existing MICR traineddata files. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago < >>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> So I did several tests from scratch. In the last >>>>>>>>>>>>>>>>>>>>> attempt, I made a training text with 4,000 lines in the >>>>>>>>>>>>>>>>>>>>> following format, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> 110004310510< <02 :4002=0181:801= 0008752 <00039 >>>>>>>>>>>>>>>>>>>>> ;0000001000; >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> and combined it with eng.digits.training_text in which >>>>>>>>>>>>>>>>>>>>> symbols are converted to E13B symbols. This makes about >>>>>>>>>>>>>>>>>>>>> 12,000 lines of >>>>>>>>>>>>>>>>>>>>> training text. It's amazing that this thing generates a >>>>>>>>>>>>>>>>>>>>> good reader out of >>>>>>>>>>>>>>>>>>>>> nowhere. But then it is not very good. For example: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134; >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> is a result on the image attached. It's close but the >>>>>>>>>>>>>>>>>>>>> last '<' in the result text doesn't exist on the image. >>>>>>>>>>>>>>>>>>>>> It's a small >>>>>>>>>>>>>>>>>>>>> failure but it causes a greater trouble in parsing. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> What would you suggest from here to increase >>>>>>>>>>>>>>>>>>>>> accuracy? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> - Increase the number of lines in the training text >>>>>>>>>>>>>>>>>>>>> - Mix up more variations in the training text >>>>>>>>>>>>>>>>>>>>> - Increase the number of iterations >>>>>>>>>>>>>>>>>>>>> - Investigate wrong reads one by one >>>>>>>>>>>>>>>>>>>>> - Or else? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Also, I referred to engrestrict*.* and could generate >>>>>>>>>>>>>>>>>>>>> similar result with the fine-tuning-from-full method. It >>>>>>>>>>>>>>>>>>>>> seems a bit >>>>>>>>>>>>>>>>>>>>> faster to get to the same level but it also stops at a >>>>>>>>>>>>>>>>>>>>> 'good' level. I can >>>>>>>>>>>>>>>>>>>>> go with either way if it takes me to the bright future. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Thanks a lot, Shree. I'll look it in. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_shreetest >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Look at the files engrestrict*.* and also >>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Create training text of about 100 lines and finetune >>>>>>>>>>>>>>>>>>>>>>> for 400 lines >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> I had about 14 lines as attached. How many lines >>>>>>>>>>>>>>>>>>>>>>>> would you recommend? >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Fine tuning gives much better result but it tends >>>>>>>>>>>>>>>>>>>>>>>> to pick other character than in E13B that only has 14 >>>>>>>>>>>>>>>>>>>>>>>> characters, 0 through >>>>>>>>>>>>>>>>>>>>>>>> 9 and 4 symbols. I thought training from scratch >>>>>>>>>>>>>>>>>>>>>>>> would eliminate such >>>>>>>>>>>>>>>>>>>>>>>> confusion. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> For training from scratch a large training text >>>>>>>>>>>>>>>>>>>>>>>>> and hundreds of thousands of iterations are >>>>>>>>>>>>>>>>>>>>>>>>> recommended. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> If you are just fine tuning for a font try to >>>>>>>>>>>>>>>>>>>>>>>>> follow instructions for training for impact, with >>>>>>>>>>>>>>>>>>>>>>>>> your font. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, < >>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, Shree. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I saw the instruction. The steps I made are >>>>>>>>>>>>>>>>>>>>>>>>>> as follows: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Using tesstrain.sh: >>>>>>>>>>>>>>>>>>>>>>>>>> src/training/tesstrain.sh --fonts_dir >>>>>>>>>>>>>>>>>>>>>>>>>> /usr/share/fonts --lang eng --linedata_only \ >>>>>>>>>>>>>>>>>>>>>>>>>> --noextract_font_properties --langdata_dir >>>>>>>>>>>>>>>>>>>>>>>>>> ../langdata \ >>>>>>>>>>>>>>>>>>>>>>>>>> --tessdata_dir ./tessdata \ >>>>>>>>>>>>>>>>>>>>>>>>>> --fontlist "E13Bnsd" --output_dir >>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval \ >>>>>>>>>>>>>>>>>>>>>>>>>> --training_text >>>>>>>>>>>>>>>>>>>>>>>>>> ../langdata/eng/eng.training_e13b_text >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Training from scratch: >>>>>>>>>>>>>>>>>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput >>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --debug_interval 100 \ >>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>>>> --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 >>>>>>>>>>>>>>>>>>>>>>>>>> Lfx96 Lrx96 Lfx256 O1c111]' \ >>>>>>>>>>>>>>>>>>>>>>>>>> --model_output ~/tesstutorial/e13boutput/base >>>>>>>>>>>>>>>>>>>>>>>>>> --learning_rate 20e-4 \ >>>>>>>>>>>>>>>>>>>>>>>>>> --train_listfile >>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \ >>>>>>>>>>>>>>>>>>>>>>>>>> --eval_listfile >>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \ >>>>>>>>>>>>>>>>>>>>>>>>>> --max_iterations 5000 >>>>>>>>>>>>>>>>>>>>>>>>>> &>~/tesstutorial/e13boutput/basetrain.log >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Test with base_checkpoint: >>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmeval --model >>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>>>> --eval_listfile >>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Combining output files: >>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --stop_training \ >>>>>>>>>>>>>>>>>>>>>>>>>> --continue_from >>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>>>> --model_output >>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/eng.traineddata >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Test with eng.traineddata: >>>>>>>>>>>>>>>>>>>>>>>>>> tesseract e13b.png out --tessdata-dir >>>>>>>>>>>>>>>>>>>>>>>>>> /home/koichi/tesstutorial/e13boutput >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> The training from scratch ended as: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> At iteration 561/2500/2500, Mean rms=0.159%, >>>>>>>>>>>>>>>>>>>>>>>>>> delta=0%, char train=0%, word train=0%, skip >>>>>>>>>>>>>>>>>>>>>>>>>> ratio=0%, New best char error >>>>>>>>>>>>>>>>>>>>>>>>>> = 0 wrote best >>>>>>>>>>>>>>>>>>>>>>>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint >>>>>>>>>>>>>>>>>>>>>>>>>> wrote >>>>>>>>>>>>>>>>>>>>>>>>>> checkpoint. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> The test with base_checkpoint returns nothing as: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> At iteration 0, stage 0, Eval Char error rate=0, >>>>>>>>>>>>>>>>>>>>>>>>>> Word error rate=0 >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> The test with eng.traineddata and e13b.png >>>>>>>>>>>>>>>>>>>>>>>>>> returns out.txt. Both files are attached. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Training seems to have worked fine. I don't know >>>>>>>>>>>>>>>>>>>>>>>>>> how to translate the test result from >>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint. The generated >>>>>>>>>>>>>>>>>>>>>>>>>> eng.traineddata obviously doesn't work well. I >>>>>>>>>>>>>>>>>>>>>>>>>> suspect the choice of >>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata in combining output files is bad but I >>>>>>>>>>>>>>>>>>>>>>>>>> have no clue. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> BTW, I referred to your tess4training in the >>>>>>>>>>>>>>>>>>>>>>>>>> process. It helped a lot. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月29日水曜日 19時14分08秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, May 29, 2019 at 3:18 PM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> I wish to make a trained data for E13B font. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> I read the training tutorial and made a >>>>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint file according to the method in >>>>>>>>>>>>>>>>>>>>>>>>>>>> Training From Scratch. >>>>>>>>>>>>>>>>>>>>>>>>>>>> Now, how can I make a trained data from the >>>>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint file? >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>>>>> You received this message because you are >>>>>>>>>>>>>>>>>>>>>>>>>>>> subscribed to the Google Groups "tesseract-ocr" >>>>>>>>>>>>>>>>>>>>>>>>>>>> group. >>>>>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop >>>>>>>>>>>>>>>>>>>>>>>>>>>> receiving emails from it, send an email to >>>>>>>>>>>>>>>>>>>>>>>>>>>> tesser...@googlegroups.com. >>>>>>>>>>>>>>>>>>>>>>>>>>>> To post to this group, send email to >>>>>>>>>>>>>>>>>>>>>>>>>>>> tesser...@googlegroups.com. >>>>>>>>>>>>>>>>>>>>>>>>>>>> Visit this group at >>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr. >>>>>>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>>>>>>>>> For more options, visit >>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/optout. >>>>>>>>>>>>>>>>>>>>>>>>>>>> </blockquote >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e6d8db44-a5cc-4a1f-b655-37c7750133a3%40googlegroups.com.