Hi, FWIW, I got to the point where I can feel happy with the accuracy. As the images of the previous post show, the symbols, especially on-us symbol and amount symbol, were causing mix-up each other or to another character. I added much more more symbols to the training text and formed words that start with a symbol. One example is as follows:
9;:;=;<;< <0<1<3<4;6;8;9;:;=; I randomly made 8,000 lines like this. In fine-tuning from eng, 5,000 iteration was almost good. Amount symbol still is confused a little when it's followed by 0. Fine tuning tends to be dragged by small particles. I'll have to think of something to make further improvement. Training from scratch produced a bit more stable traineddata. It doesn't get confused with symbols so often but tends to generate extra spaces. By 10,000 iterations, those spaces are gone and recognition became very solid. I thought I might have to do image and box file training but I guess it's not needed this time. ElMagoElGato 2019年7月26日金曜日 14時08分06秒 UTC+9 ElGato ElMago: > > HI, > > Well, I read the description of ScrollView ( > https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging) and it > says: > > To show the characters, deselect DISPLAY/Bounding Boxes, select > DISPLAY/Polygonal Approx and then select OTHER/Uniform display. > > > It basically works. But for some reason, it doesn't work on my e13b image > and ends up with a blue screen. Anyway, it shows each box separately when > a character is consist of multiple boxes. I'd like to show the box for the > whole character. ScrollView doesn't do it, at least, yet. I'll do it on > my own. > > ElMagoElGato > > 2019年7月24日水曜日 14時10分46秒 UTC+9 ElGato ElMago: >> >> Hi, >> >> >> I got this result from hocr. This is where one of the phantom characters >> comes from. >> >> <span class='ocrx_cinfo' title='x_bboxes 1259 902 1262 933; x_conf >> 98.864532'><</span> >> <span class='ocrx_cinfo' title='x_bboxes 1259 904 1281 933; x_conf >> 99.018097'>;</span> >> >> >> The firs character is the phantom. It starts with the second character >> that exists on x axis. The first character only has 3 points width. I >> attach ScrollView screen shots that visualize this. >> >> [image: 2019-07-24-132643_854x707_scrot.png][image: >> 2019-07-24-132800_854x707_scrot.png] >> >> >> There seem to be some more cases to cause phantom characters. I'll look >> them in. But I have a trivial question now. I made ScrollView show these >> displays by accidentally clicking Display->Blamer menu. There is Bounding >> Boxes menu below but it ends up showing a blue screen though it briefly >> shows boxes on the way. Can I use this menu at all? It'll be very useful. >> >> [image: 2019-07-24-140739_854x707_scrot.png] >> >> >> 2019年7月23日火曜日 17時10分36秒 UTC+9 ElGato ElMago: >>> >>> It's great! Perfect! Thanks a lot! >>> >>> 2019年7月23日火曜日 10時56分58秒 UTC+9 shree: >>>> >>>> See https://github.com/tesseract-ocr/tesseract/issues/2580 >>>> >>>> On Tue, 23 Jul 2019, 06:23 ElGato ElMago, <elmago...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> I read the output of hocr with lstm_choice_mode = 4 as to the pull >>>>> request 2554. It shows the candidates for each character but doesn't >>>>> show >>>>> bounding box of each character. I only shows the box for a whole word. >>>>> >>>>> I see bounding boxes of each character in comments of the pull request >>>>> 2576. How can I do that? Do I have to look in the source code and >>>>> manipulate such an output on my own? >>>>> >>>>> 2019年7月19日金曜日 18時40分49秒 UTC+9 ElGato ElMago: >>>>> >>>>>> Lorenzo, >>>>>> >>>>>> I haven't been checking psm too much. Will turn to those options >>>>>> after I see how it goes with bounding boxes. >>>>>> >>>>>> Shree, >>>>>> >>>>>> I see the merges in the git log and also see that new >>>>>> option lstm_choice_amount works now. I guess my executable is latest >>>>>> though I still see the phantom character. Hocr makes huge and complex >>>>>> output. I'll take some to read it. >>>>>> >>>>>> 2019年7月19日金曜日 18時20分55秒 UTC+9 Claudiu: >>>>>>> >>>>>>> Is there any way to pass bounding boxes to use to the LSTM? We have >>>>>>> an algorithm that cleanly gets bounding boxes of MRZ characters. >>>>>>> However >>>>>>> the results using psm 10 are worse than passing the whole line in. Yet >>>>>>> when >>>>>>> we pass the whole line in we get these phantom characters. >>>>>>> >>>>>>> Should PSM 10 mode work? It often returns “no character” where there >>>>>>> clearly is one. I can supply a test case if it is expected to work >>>>>>> well. >>>>>>> >>>>>>> On Fri, Jul 19, 2019 at 11:06 AM ElGato ElMago <elmago...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Lorenzo, >>>>>>>> >>>>>>>> We both have got the same case. It seems a solution to this >>>>>>>> problem would save a lot of people. >>>>>>>> >>>>>>>> Shree, >>>>>>>> >>>>>>>> I pulled the current head of master branch but it doesn't seem to >>>>>>>> contain the merges you pointed that have been merged 3 to 4 days ago. >>>>>>>> How >>>>>>>> can I get them? >>>>>>>> >>>>>>>> ElMagoElGato >>>>>>>> >>>>>>>> 2019年7月19日金曜日 17時02分53秒 UTC+9 Lorenzo Blz: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> PSM 7 was a partial solution for my specific case, it improved the >>>>>>>>> situation but did not solve it. Also I could not use it in some other >>>>>>>>> cases. >>>>>>>>> >>>>>>>>> The proper solution is very likely doing more training with more >>>>>>>>> data, some data augmentation might probably help if data is scarce. >>>>>>>>> Also doing less training might help is the training is not done >>>>>>>>> correctly. >>>>>>>>> >>>>>>>>> There are also similar issues on github: >>>>>>>>> >>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1465 >>>>>>>>> ... >>>>>>>>> >>>>>>>>> The LSTM engine works like this: it scans the image and for each >>>>>>>>> "pixel column" does this: >>>>>>>>> >>>>>>>>> M M M M N M M M [BLANK] F F F F >>>>>>>>> >>>>>>>>> (here i report only the highest probability characters) >>>>>>>>> >>>>>>>>> In the example above an M is partially seen as an N, this is >>>>>>>>> normal, and another step of the algorithm (beam search I think) tries >>>>>>>>> to >>>>>>>>> aggregate back the correct characters. >>>>>>>>> >>>>>>>>> I think cases like this: >>>>>>>>> >>>>>>>>> M M M N N N M M >>>>>>>>> >>>>>>>>> are what gives the phantom characters. More training should reduce >>>>>>>>> the source of the problem or a painful analysis of the bounding boxes >>>>>>>>> might >>>>>>>>> fix some cases. >>>>>>>>> >>>>>>>>> >>>>>>>>> I used the attached script for the boxes. >>>>>>>>> >>>>>>>>> >>>>>>>>> Lorenzo >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Il giorno ven 19 lug 2019 alle ore 07:25 ElGato ElMago < >>>>>>>>> elmago...@gmail.com> ha scritto: >>>>>>>>> >>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> Let's call them phantom characters then. >>>>>>>>>> >>>>>>>>>> Was psm 7 the solution for the issue 1778? None of the psm >>>>>>>>>> option didn't solve my problem though I see different output. >>>>>>>>>> >>>>>>>>>> I use tesseract 5.0-alpha mostly but 4.1 showed the same results >>>>>>>>>> anyway. How did you get bounding box for each character? Alto and >>>>>>>>>> lstmbox >>>>>>>>>> only show bbox for a group of characters. >>>>>>>>>> >>>>>>>>>> ElMagoElGato >>>>>>>>>> >>>>>>>>>> 2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz: >>>>>>>>>> >>>>>>>>>>> Phantom characters here for me too: >>>>>>>>>>> >>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1778 >>>>>>>>>>> >>>>>>>>>>> Are you using 4.1? Bounding boxes were fixed in 4.1 maybe this >>>>>>>>>>> was also improved. >>>>>>>>>>> >>>>>>>>>>> I wrote some code that uses symbols iterator to discard symbols >>>>>>>>>>> that are clearly duplicated: too small, overlapping, etc. But it >>>>>>>>>>> was not >>>>>>>>>>> easy to make it work decently and it is not 100% reliable with >>>>>>>>>>> false >>>>>>>>>>> negatives and positives. I cannot share the code and it is quite >>>>>>>>>>> ugly >>>>>>>>>>> anyway. >>>>>>>>>>> >>>>>>>>>>> Here there is another MRZ model with training data: >>>>>>>>>>> >>>>>>>>>>> https://github.com/DoubangoTelecom/tesseractMRZ >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Lorenzo >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu < >>>>>>>>>>> csaf...@gmail.com> ha scritto: >>>>>>>>>>> >>>>>>>>>>>> I’m getting the “phantom character” issue as well using the >>>>>>>>>>>> OCRB that Shree trained on MRZ lines. For example for a 0 it will >>>>>>>>>>>> sometimes >>>>>>>>>>>> add both a 0 and an O to the output , thus outputting 45 >>>>>>>>>>>> characters total >>>>>>>>>>>> instead of 44. I haven’t looked at the bounding box output yet but >>>>>>>>>>>> I >>>>>>>>>>>> suspect a phantom thin character is added somewhere that I can >>>>>>>>>>>> discard .. >>>>>>>>>>>> or maybe two chars will have the same bounding box. If anyone else >>>>>>>>>>>> has >>>>>>>>>>>> fixed this issue further up (eg so the output doesn’t contain the >>>>>>>>>>>> phantom >>>>>>>>>>>> characters in the first place) id be interested. >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago < >>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> I'll go back to more of training later. Before doing so, I'd >>>>>>>>>>>>> like to investigate results a little bit. The hocr and lstmbox >>>>>>>>>>>>> options >>>>>>>>>>>>> give some details of positions of characters. The results show >>>>>>>>>>>>> positions >>>>>>>>>>>>> that perfectly correspond to letters in the image. But the text >>>>>>>>>>>>> output >>>>>>>>>>>>> contains a character that obviously does not exist. >>>>>>>>>>>>> >>>>>>>>>>>>> Then I found a config file 'lstmdebug' that generates far more >>>>>>>>>>>>> information. I hope it explains what happened with each >>>>>>>>>>>>> character. I'm >>>>>>>>>>>>> yet to read the debug output but I'd appreciate it if someone >>>>>>>>>>>>> could tell me >>>>>>>>>>>>> how to read it because it's really complex. >>>>>>>>>>>>> >>>>>>>>>>>>> Regards, >>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>> >>>>>>>>>>>>> 2019年6月14日金曜日 19時58分49秒 UTC+9 shree: >>>>>>>>>>>>> >>>>>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_MICR >>>>>>>>>>>>>> >>>>>>>>>>>>>> I have uploaded my files there. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh >>>>>>>>>>>>>> is the bash script that runs the training. >>>>>>>>>>>>>> >>>>>>>>>>>>>> You can modify as needed. Please note this is for legacy/base >>>>>>>>>>>>>> tesseract --oem 0. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago < >>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks a lot, shree. It seems you know everything. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I tried the MICR0.traineddata and the first two >>>>>>>>>>>>>>> mcr.traineddata. The last one was blocked by the browser. >>>>>>>>>>>>>>> Each of the >>>>>>>>>>>>>>> traineddata had mixed results. All of them are getting symbols >>>>>>>>>>>>>>> fairly good >>>>>>>>>>>>>>> but getting spaces randomly and reading some numbers wrong. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> MICR0 seems the best among them. Did you suggest that you'd >>>>>>>>>>>>>>> be able to update it? It gets tripple D very often where >>>>>>>>>>>>>>> there's only one, >>>>>>>>>>>>>>> and so on. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Also, I tried to fine tune from MICR0 but I found that I >>>>>>>>>>>>>>> need to change the language-specific.sh. It specifies some >>>>>>>>>>>>>>> parameters for >>>>>>>>>>>>>>> each language. Do you have any guidance for it? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>> http://www.devscope.net/Content/ocrchecks.aspx >>>>>>>>>>>>>>>> https://github.com/BigPino67/Tesseract-MICR-OCR >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago < >>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> That'll be nice if there's traineddata out there but I >>>>>>>>>>>>>>>>> didn't find any. I see free fonts and commercial OCR >>>>>>>>>>>>>>>>> software but not >>>>>>>>>>>>>>>>> traineddata. Tessdata repository obviously doesn't have one, >>>>>>>>>>>>>>>>> either. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Please also search for existing MICR traineddata files. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago < >>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> So I did several tests from scratch. In the last >>>>>>>>>>>>>>>>>>> attempt, I made a training text with 4,000 lines in the >>>>>>>>>>>>>>>>>>> following format, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 110004310510< <02 :4002=0181:801= 0008752 <00039 >>>>>>>>>>>>>>>>>>> ;0000001000; >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> and combined it with eng.digits.training_text in which >>>>>>>>>>>>>>>>>>> symbols are converted to E13B symbols. This makes about >>>>>>>>>>>>>>>>>>> 12,000 lines of >>>>>>>>>>>>>>>>>>> training text. It's amazing that this thing generates a >>>>>>>>>>>>>>>>>>> good reader out of >>>>>>>>>>>>>>>>>>> nowhere. But then it is not very good. For example: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134; >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> is a result on the image attached. It's close but the >>>>>>>>>>>>>>>>>>> last '<' in the result text doesn't exist on the image. >>>>>>>>>>>>>>>>>>> It's a small >>>>>>>>>>>>>>>>>>> failure but it causes a greater trouble in parsing. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> What would you suggest from here to increase accuracy? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> - Increase the number of lines in the training text >>>>>>>>>>>>>>>>>>> - Mix up more variations in the training text >>>>>>>>>>>>>>>>>>> - Increase the number of iterations >>>>>>>>>>>>>>>>>>> - Investigate wrong reads one by one >>>>>>>>>>>>>>>>>>> - Or else? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Also, I referred to engrestrict*.* and could generate >>>>>>>>>>>>>>>>>>> similar result with the fine-tuning-from-full method. It >>>>>>>>>>>>>>>>>>> seems a bit >>>>>>>>>>>>>>>>>>> faster to get to the same level but it also stops at a >>>>>>>>>>>>>>>>>>> 'good' level. I can >>>>>>>>>>>>>>>>>>> go with either way if it takes me to the bright future. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks a lot, Shree. I'll look it in. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_shreetest >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Look at the files engrestrict*.* and also >>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Create training text of about 100 lines and finetune >>>>>>>>>>>>>>>>>>>>> for 400 lines >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I had about 14 lines as attached. How many lines >>>>>>>>>>>>>>>>>>>>>> would you recommend? >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Fine tuning gives much better result but it tends to >>>>>>>>>>>>>>>>>>>>>> pick other character than in E13B that only has 14 >>>>>>>>>>>>>>>>>>>>>> characters, 0 through 9 >>>>>>>>>>>>>>>>>>>>>> and 4 symbols. I thought training from scratch would >>>>>>>>>>>>>>>>>>>>>> eliminate such >>>>>>>>>>>>>>>>>>>>>> confusion. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> For training from scratch a large training text and >>>>>>>>>>>>>>>>>>>>>>> hundreds of thousands of iterations are recommended. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> If you are just fine tuning for a font try to follow >>>>>>>>>>>>>>>>>>>>>>> instructions for training for impact, with your font. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, < >>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Thanks, Shree. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Yes, I saw the instruction. The steps I made are >>>>>>>>>>>>>>>>>>>>>>>> as follows: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Using tesstrain.sh: >>>>>>>>>>>>>>>>>>>>>>>> src/training/tesstrain.sh --fonts_dir >>>>>>>>>>>>>>>>>>>>>>>> /usr/share/fonts --lang eng --linedata_only \ >>>>>>>>>>>>>>>>>>>>>>>> --noextract_font_properties --langdata_dir >>>>>>>>>>>>>>>>>>>>>>>> ../langdata \ >>>>>>>>>>>>>>>>>>>>>>>> --tessdata_dir ./tessdata \ >>>>>>>>>>>>>>>>>>>>>>>> --fontlist "E13Bnsd" --output_dir >>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval \ >>>>>>>>>>>>>>>>>>>>>>>> --training_text >>>>>>>>>>>>>>>>>>>>>>>> ../langdata/eng/eng.training_e13b_text >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Training from scratch: >>>>>>>>>>>>>>>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput >>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --debug_interval 100 \ >>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>> --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 >>>>>>>>>>>>>>>>>>>>>>>> Lrx96 Lfx256 O1c111]' \ >>>>>>>>>>>>>>>>>>>>>>>> --model_output ~/tesstutorial/e13boutput/base >>>>>>>>>>>>>>>>>>>>>>>> --learning_rate 20e-4 \ >>>>>>>>>>>>>>>>>>>>>>>> --train_listfile >>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \ >>>>>>>>>>>>>>>>>>>>>>>> --eval_listfile >>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \ >>>>>>>>>>>>>>>>>>>>>>>> --max_iterations 5000 >>>>>>>>>>>>>>>>>>>>>>>> &>~/tesstutorial/e13boutput/basetrain.log >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Test with base_checkpoint: >>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmeval --model >>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>> --eval_listfile >>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Combining output files: >>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --stop_training \ >>>>>>>>>>>>>>>>>>>>>>>> --continue_from >>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>> --model_output >>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/eng.traineddata >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Test with eng.traineddata: >>>>>>>>>>>>>>>>>>>>>>>> tesseract e13b.png out --tessdata-dir >>>>>>>>>>>>>>>>>>>>>>>> /home/koichi/tesstutorial/e13boutput >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> The training from scratch ended as: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> At iteration 561/2500/2500, Mean rms=0.159%, >>>>>>>>>>>>>>>>>>>>>>>> delta=0%, char train=0%, word train=0%, skip ratio=0%, >>>>>>>>>>>>>>>>>>>>>>>> New best char error >>>>>>>>>>>>>>>>>>>>>>>> = 0 wrote best >>>>>>>>>>>>>>>>>>>>>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint >>>>>>>>>>>>>>>>>>>>>>>> wrote >>>>>>>>>>>>>>>>>>>>>>>> checkpoint. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> The test with base_checkpoint returns nothing as: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> At iteration 0, stage 0, Eval Char error rate=0, >>>>>>>>>>>>>>>>>>>>>>>> Word error rate=0 >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> The test with eng.traineddata and e13b.png returns >>>>>>>>>>>>>>>>>>>>>>>> out.txt. Both files are attached. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Training seems to have worked fine. I don't know >>>>>>>>>>>>>>>>>>>>>>>> how to translate the test result from base_checkpoint. >>>>>>>>>>>>>>>>>>>>>>>> The generated >>>>>>>>>>>>>>>>>>>>>>>> eng.traineddata obviously doesn't work well. I suspect >>>>>>>>>>>>>>>>>>>>>>>> the choice of >>>>>>>>>>>>>>>>>>>>>>>> --traineddata in combining output files is bad but I >>>>>>>>>>>>>>>>>>>>>>>> have no clue. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> BTW, I referred to your tess4training in the >>>>>>>>>>>>>>>>>>>>>>>> process. It helped a lot. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> 2019年5月29日水曜日 19時14分08秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Wed, May 29, 2019 at 3:18 PM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> I wish to make a trained data for E13B font. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> I read the training tutorial and made a >>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint file according to the method in >>>>>>>>>>>>>>>>>>>>>>>>>> Training From Scratch. >>>>>>>>>>>>>>>>>>>>>>>>>> Now, how can I make a trained data from the >>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint file? >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>>> You received this message because you are >>>>>>>>>>>>>>>>>>>>>>>>>> subscribed to the Google Groups "tesseract-ocr" >>>>>>>>>>>>>>>>>>>>>>>>>> group. >>>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving >>>>>>>>>>>>>>>>>>>>>>>>>> emails from it, send an email to >>>>>>>>>>>>>>>>>>>>>>>>>> tesser...@googlegroups.com. >>>>>>>>>>>>>>>>>>>>>>>>>> To post to this group, send email to >>>>>>>>>>>>>>>>>>>>>>>>>> tesser...@googlegroups.com. >>>>>>>>>>>>>>>>>>>>>>>>>> Visit this group at >>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr. >>>>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>>>>>>> For more options, visit >>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/optout. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> ____________________________________________________________ >>>>>>>>>>>>>>>>>>>>>>>>> भजन - कीर्तन - आरती @ >>>>>>>>>>>>>>>>>>>>>>>>> http://bhajans.ramparivar.com >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>> You received this message because you are >>>>>>>>>>>>>>>>>>>>>>>> subscribed to the Google Groups "tesseract-ocr" group. >>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving >>>>>>>>>>>>>>>>>>>>>>>> emails from it, send an email to >>>>>>>>>>>>>>>>>>>>>>>> tesser...@googlegroups.com. >>>>>>>>>>>>>>>>>>>>>>>> To post to this group, send email to >>>>>>>>>>>>>>>>>>>>>>>> tesser...@googlegroups.com. >>>>>>>>>>>>>>>>>>>>>>>> Visit this group at >>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr. >>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>>>>> For more options, visit <a href=" >>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/optout" >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3cecc106-fbb9-4a4a-bd98-e992ec034cef%40googlegroups.com.