Hello, Are you planning to release the dataset or models? I'm working on the same subject and planning to share both under BSD terms
On Tuesday, August 6, 2019 at 10:11:40 AM UTC+2, ElGato ElMago wrote: > > Hi, > > FWIW, I got to the point where I can feel happy with the accuracy. As the > images of the previous post show, the symbols, especially on-us symbol and > amount symbol, were causing mix-up each other or to another character. I > added much more more symbols to the training text and formed words that > start with a symbol. One example is as follows: > > 9;:;=;<;< <0<1<3<4;6;8;9;:;=; > > > I randomly made 8,000 lines like this. In fine-tuning from eng, 5,000 > iteration was almost good. Amount symbol still is confused a little when > it's followed by 0. Fine tuning tends to be dragged by small particles. > I'll have to think of something to make further improvement. > > Training from scratch produced a bit more stable traineddata. It doesn't > get confused with symbols so often but tends to generate extra spaces. By > 10,000 iterations, those spaces are gone and recognition became very solid. > > I thought I might have to do image and box file training but I guess it's > not needed this time. > > ElMagoElGato > > 2019年7月26日金曜日 14時08分06秒 UTC+9 ElGato ElMago: >> >> HI, >> >> Well, I read the description of ScrollView ( >> https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging) and it >> says: >> >> To show the characters, deselect DISPLAY/Bounding Boxes, select >> DISPLAY/Polygonal Approx and then select OTHER/Uniform display. >> >> >> It basically works. But for some reason, it doesn't work on my e13b >> image and ends up with a blue screen. Anyway, it shows each box separately >> when a character is consist of multiple boxes. I'd like to show the box >> for the whole character. ScrollView doesn't do it, at least, yet. I'll do >> it on my own. >> >> ElMagoElGato >> >> 2019年7月24日水曜日 14時10分46秒 UTC+9 ElGato ElMago: >>> >>> Hi, >>> >>> >>> I got this result from hocr. This is where one of the phantom >>> characters comes from. >>> >>> <span class='ocrx_cinfo' title='x_bboxes 1259 902 1262 933; x_conf >>> 98.864532'><</span> >>> <span class='ocrx_cinfo' title='x_bboxes 1259 904 1281 933; x_conf >>> 99.018097'>;</span> >>> >>> >>> The firs character is the phantom. It starts with the second character >>> that exists on x axis. The first character only has 3 points width. I >>> attach ScrollView screen shots that visualize this. >>> >>> [image: 2019-07-24-132643_854x707_scrot.png][image: >>> 2019-07-24-132800_854x707_scrot.png] >>> >>> >>> There seem to be some more cases to cause phantom characters. I'll look >>> them in. But I have a trivial question now. I made ScrollView show these >>> displays by accidentally clicking Display->Blamer menu. There is Bounding >>> Boxes menu below but it ends up showing a blue screen though it briefly >>> shows boxes on the way. Can I use this menu at all? It'll be very useful. >>> >>> [image: 2019-07-24-140739_854x707_scrot.png] >>> >>> >>> 2019年7月23日火曜日 17時10分36秒 UTC+9 ElGato ElMago: >>>> >>>> It's great! Perfect! Thanks a lot! >>>> >>>> 2019年7月23日火曜日 10時56分58秒 UTC+9 shree: >>>>> >>>>> See https://github.com/tesseract-ocr/tesseract/issues/2580 >>>>> >>>>> On Tue, 23 Jul 2019, 06:23 ElGato ElMago, <elmago...@gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I read the output of hocr with lstm_choice_mode = 4 as to the pull >>>>>> request 2554. It shows the candidates for each character but doesn't >>>>>> show >>>>>> bounding box of each character. I only shows the box for a whole word. >>>>>> >>>>>> I see bounding boxes of each character in comments of the pull >>>>>> request 2576. How can I do that? Do I have to look in the source code >>>>>> and >>>>>> manipulate such an output on my own? >>>>>> >>>>>> 2019年7月19日金曜日 18時40分49秒 UTC+9 ElGato ElMago: >>>>>> >>>>>>> Lorenzo, >>>>>>> >>>>>>> I haven't been checking psm too much. Will turn to those options >>>>>>> after I see how it goes with bounding boxes. >>>>>>> >>>>>>> Shree, >>>>>>> >>>>>>> I see the merges in the git log and also see that new >>>>>>> option lstm_choice_amount works now. I guess my executable is latest >>>>>>> though I still see the phantom character. Hocr makes huge and complex >>>>>>> output. I'll take some to read it. >>>>>>> >>>>>>> 2019年7月19日金曜日 18時20分55秒 UTC+9 Claudiu: >>>>>>>> >>>>>>>> Is there any way to pass bounding boxes to use to the LSTM? We have >>>>>>>> an algorithm that cleanly gets bounding boxes of MRZ characters. >>>>>>>> However >>>>>>>> the results using psm 10 are worse than passing the whole line in. Yet >>>>>>>> when >>>>>>>> we pass the whole line in we get these phantom characters. >>>>>>>> >>>>>>>> Should PSM 10 mode work? It often returns “no character” where >>>>>>>> there clearly is one. I can supply a test case if it is expected to >>>>>>>> work >>>>>>>> well. >>>>>>>> >>>>>>>> On Fri, Jul 19, 2019 at 11:06 AM ElGato ElMago <elmago...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Lorenzo, >>>>>>>>> >>>>>>>>> We both have got the same case. It seems a solution to this >>>>>>>>> problem would save a lot of people. >>>>>>>>> >>>>>>>>> Shree, >>>>>>>>> >>>>>>>>> I pulled the current head of master branch but it doesn't seem to >>>>>>>>> contain the merges you pointed that have been merged 3 to 4 days ago. >>>>>>>>> How >>>>>>>>> can I get them? >>>>>>>>> >>>>>>>>> ElMagoElGato >>>>>>>>> >>>>>>>>> 2019年7月19日金曜日 17時02分53秒 UTC+9 Lorenzo Blz: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> PSM 7 was a partial solution for my specific case, it improved >>>>>>>>>> the situation but did not solve it. Also I could not use it in some >>>>>>>>>> other >>>>>>>>>> cases. >>>>>>>>>> >>>>>>>>>> The proper solution is very likely doing more training with more >>>>>>>>>> data, some data augmentation might probably help if data is scarce. >>>>>>>>>> Also doing less training might help is the training is not done >>>>>>>>>> correctly. >>>>>>>>>> >>>>>>>>>> There are also similar issues on github: >>>>>>>>>> >>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1465 >>>>>>>>>> ... >>>>>>>>>> >>>>>>>>>> The LSTM engine works like this: it scans the image and for each >>>>>>>>>> "pixel column" does this: >>>>>>>>>> >>>>>>>>>> M M M M N M M M [BLANK] F F F F >>>>>>>>>> >>>>>>>>>> (here i report only the highest probability characters) >>>>>>>>>> >>>>>>>>>> In the example above an M is partially seen as an N, this is >>>>>>>>>> normal, and another step of the algorithm (beam search I think) >>>>>>>>>> tries to >>>>>>>>>> aggregate back the correct characters. >>>>>>>>>> >>>>>>>>>> I think cases like this: >>>>>>>>>> >>>>>>>>>> M M M N N N M M >>>>>>>>>> >>>>>>>>>> are what gives the phantom characters. More training should >>>>>>>>>> reduce the source of the problem or a painful analysis of the >>>>>>>>>> bounding >>>>>>>>>> boxes might fix some cases. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I used the attached script for the boxes. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Lorenzo >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Il giorno ven 19 lug 2019 alle ore 07:25 ElGato ElMago < >>>>>>>>>> elmago...@gmail.com> ha scritto: >>>>>>>>>> >>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> Let's call them phantom characters then. >>>>>>>>>>> >>>>>>>>>>> Was psm 7 the solution for the issue 1778? None of the psm >>>>>>>>>>> option didn't solve my problem though I see different output. >>>>>>>>>>> >>>>>>>>>>> I use tesseract 5.0-alpha mostly but 4.1 showed the same results >>>>>>>>>>> anyway. How did you get bounding box for each character? Alto and >>>>>>>>>>> lstmbox >>>>>>>>>>> only show bbox for a group of characters. >>>>>>>>>>> >>>>>>>>>>> ElMagoElGato >>>>>>>>>>> >>>>>>>>>>> 2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz: >>>>>>>>>>> >>>>>>>>>>>> Phantom characters here for me too: >>>>>>>>>>>> >>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1778 >>>>>>>>>>>> >>>>>>>>>>>> Are you using 4.1? Bounding boxes were fixed in 4.1 maybe this >>>>>>>>>>>> was also improved. >>>>>>>>>>>> >>>>>>>>>>>> I wrote some code that uses symbols iterator to discard symbols >>>>>>>>>>>> that are clearly duplicated: too small, overlapping, etc. But it >>>>>>>>>>>> was not >>>>>>>>>>>> easy to make it work decently and it is not 100% reliable with >>>>>>>>>>>> false >>>>>>>>>>>> negatives and positives. I cannot share the code and it is quite >>>>>>>>>>>> ugly >>>>>>>>>>>> anyway. >>>>>>>>>>>> >>>>>>>>>>>> Here there is another MRZ model with training data: >>>>>>>>>>>> >>>>>>>>>>>> https://github.com/DoubangoTelecom/tesseractMRZ >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Lorenzo >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu < >>>>>>>>>>>> csaf...@gmail.com> ha scritto: >>>>>>>>>>>> >>>>>>>>>>>>> I’m getting the “phantom character” issue as well using the >>>>>>>>>>>>> OCRB that Shree trained on MRZ lines. For example for a 0 it will >>>>>>>>>>>>> sometimes >>>>>>>>>>>>> add both a 0 and an O to the output , thus outputting 45 >>>>>>>>>>>>> characters total >>>>>>>>>>>>> instead of 44. I haven’t looked at the bounding box output yet >>>>>>>>>>>>> but I >>>>>>>>>>>>> suspect a phantom thin character is added somewhere that I can >>>>>>>>>>>>> discard .. >>>>>>>>>>>>> or maybe two chars will have the same bounding box. If anyone >>>>>>>>>>>>> else has >>>>>>>>>>>>> fixed this issue further up (eg so the output doesn’t contain the >>>>>>>>>>>>> phantom >>>>>>>>>>>>> characters in the first place) id be interested. >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago < >>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'll go back to more of training later. Before doing so, I'd >>>>>>>>>>>>>> like to investigate results a little bit. The hocr and lstmbox >>>>>>>>>>>>>> options >>>>>>>>>>>>>> give some details of positions of characters. The results show >>>>>>>>>>>>>> positions >>>>>>>>>>>>>> that perfectly correspond to letters in the image. But the text >>>>>>>>>>>>>> output >>>>>>>>>>>>>> contains a character that obviously does not exist. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Then I found a config file 'lstmdebug' that generates far >>>>>>>>>>>>>> more information. I hope it explains what happened with each >>>>>>>>>>>>>> character. >>>>>>>>>>>>>> I'm yet to read the debug output but I'd appreciate it if >>>>>>>>>>>>>> someone could >>>>>>>>>>>>>> tell me how to read it because it's really complex. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2019年6月14日金曜日 19時58分49秒 UTC+9 shree: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_MICR >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I have uploaded my files there. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh >>>>>>>>>>>>>>> is the bash script that runs the training. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> You can modify as needed. Please note this is for >>>>>>>>>>>>>>> legacy/base tesseract --oem 0. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago < >>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks a lot, shree. It seems you know everything. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I tried the MICR0.traineddata and the first two >>>>>>>>>>>>>>>> mcr.traineddata. The last one was blocked by the browser. >>>>>>>>>>>>>>>> Each of the >>>>>>>>>>>>>>>> traineddata had mixed results. All of them are getting >>>>>>>>>>>>>>>> symbols fairly good >>>>>>>>>>>>>>>> but getting spaces randomly and reading some numbers wrong. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> MICR0 seems the best among them. Did you suggest that >>>>>>>>>>>>>>>> you'd be able to update it? It gets tripple D very often >>>>>>>>>>>>>>>> where there's >>>>>>>>>>>>>>>> only one, and so on. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Also, I tried to fine tune from MICR0 but I found that I >>>>>>>>>>>>>>>> need to change the language-specific.sh. It specifies some >>>>>>>>>>>>>>>> parameters for >>>>>>>>>>>>>>>> each language. Do you have any guidance for it? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>> http://www.devscope.net/Content/ocrchecks.aspx >>>>>>>>>>>>>>>>> https://github.com/BigPino67/Tesseract-MICR-OCR >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago < >>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> That'll be nice if there's traineddata out there but I >>>>>>>>>>>>>>>>>> didn't find any. I see free fonts and commercial OCR >>>>>>>>>>>>>>>>>> software but not >>>>>>>>>>>>>>>>>> traineddata. Tessdata repository obviously doesn't have >>>>>>>>>>>>>>>>>> one, either. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Please also search for existing MICR traineddata files. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago < >>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> So I did several tests from scratch. In the last >>>>>>>>>>>>>>>>>>>> attempt, I made a training text with 4,000 lines in the >>>>>>>>>>>>>>>>>>>> following format, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> 110004310510< <02 :4002=0181:801= 0008752 <00039 >>>>>>>>>>>>>>>>>>>> ;0000001000; >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> and combined it with eng.digits.training_text in which >>>>>>>>>>>>>>>>>>>> symbols are converted to E13B symbols. This makes about >>>>>>>>>>>>>>>>>>>> 12,000 lines of >>>>>>>>>>>>>>>>>>>> training text. It's amazing that this thing generates a >>>>>>>>>>>>>>>>>>>> good reader out of >>>>>>>>>>>>>>>>>>>> nowhere. But then it is not very good. For example: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134; >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> is a result on the image attached. It's close but the >>>>>>>>>>>>>>>>>>>> last '<' in the result text doesn't exist on the image. >>>>>>>>>>>>>>>>>>>> It's a small >>>>>>>>>>>>>>>>>>>> failure but it causes a greater trouble in parsing. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> What would you suggest from here to increase accuracy? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> - Increase the number of lines in the training text >>>>>>>>>>>>>>>>>>>> - Mix up more variations in the training text >>>>>>>>>>>>>>>>>>>> - Increase the number of iterations >>>>>>>>>>>>>>>>>>>> - Investigate wrong reads one by one >>>>>>>>>>>>>>>>>>>> - Or else? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Also, I referred to engrestrict*.* and could generate >>>>>>>>>>>>>>>>>>>> similar result with the fine-tuning-from-full method. It >>>>>>>>>>>>>>>>>>>> seems a bit >>>>>>>>>>>>>>>>>>>> faster to get to the same level but it also stops at a >>>>>>>>>>>>>>>>>>>> 'good' level. I can >>>>>>>>>>>>>>>>>>>> go with either way if it takes me to the bright future. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thanks a lot, Shree. I'll look it in. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_shreetest >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Look at the files engrestrict*.* and also >>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Create training text of about 100 lines and finetune >>>>>>>>>>>>>>>>>>>>>> for 400 lines >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> I had about 14 lines as attached. How many lines >>>>>>>>>>>>>>>>>>>>>>> would you recommend? >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Fine tuning gives much better result but it tends to >>>>>>>>>>>>>>>>>>>>>>> pick other character than in E13B that only has 14 >>>>>>>>>>>>>>>>>>>>>>> characters, 0 through 9 >>>>>>>>>>>>>>>>>>>>>>> and 4 symbols. I thought training from scratch would >>>>>>>>>>>>>>>>>>>>>>> eliminate such >>>>>>>>>>>>>>>>>>>>>>> confusion. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> For training from scratch a large training text and >>>>>>>>>>>>>>>>>>>>>>>> hundreds of thousands of iterations are recommended. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> If you are just fine tuning for a font try to >>>>>>>>>>>>>>>>>>>>>>>> follow instructions for training for impact, with your >>>>>>>>>>>>>>>>>>>>>>>> font. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, < >>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks, Shree. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Yes, I saw the instruction. The steps I made are >>>>>>>>>>>>>>>>>>>>>>>>> as follows: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Using tesstrain.sh: >>>>>>>>>>>>>>>>>>>>>>>>> src/training/tesstrain.sh --fonts_dir >>>>>>>>>>>>>>>>>>>>>>>>> /usr/share/fonts --lang eng --linedata_only \ >>>>>>>>>>>>>>>>>>>>>>>>> --noextract_font_properties --langdata_dir >>>>>>>>>>>>>>>>>>>>>>>>> ../langdata \ >>>>>>>>>>>>>>>>>>>>>>>>> --tessdata_dir ./tessdata \ >>>>>>>>>>>>>>>>>>>>>>>>> --fontlist "E13Bnsd" --output_dir >>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval \ >>>>>>>>>>>>>>>>>>>>>>>>> --training_text >>>>>>>>>>>>>>>>>>>>>>>>> ../langdata/eng/eng.training_e13b_text >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Training from scratch: >>>>>>>>>>>>>>>>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput >>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --debug_interval 100 \ >>>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>>> --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 >>>>>>>>>>>>>>>>>>>>>>>>> Lfx96 Lrx96 Lfx256 O1c111]' \ >>>>>>>>>>>>>>>>>>>>>>>>> --model_output ~/tesstutorial/e13boutput/base >>>>>>>>>>>>>>>>>>>>>>>>> --learning_rate 20e-4 \ >>>>>>>>>>>>>>>>>>>>>>>>> --train_listfile >>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \ >>>>>>>>>>>>>>>>>>>>>>>>> --eval_listfile >>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \ >>>>>>>>>>>>>>>>>>>>>>>>> --max_iterations 5000 >>>>>>>>>>>>>>>>>>>>>>>>> &>~/tesstutorial/e13boutput/basetrain.log >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Test with base_checkpoint: >>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmeval --model >>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>>> --eval_listfile >>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Combining output files: >>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --stop_training \ >>>>>>>>>>>>>>>>>>>>>>>>> --continue_from >>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>>> --model_output >>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/eng.traineddata >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Test with eng.traineddata: >>>>>>>>>>>>>>>>>>>>>>>>> tesseract e13b.png out --tessdata-dir >>>>>>>>>>>>>>>>>>>>>>>>> /home/koichi/tesstutorial/e13boutput >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> The training from scratch ended as: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> At iteration 561/2500/2500, Mean rms=0.159%, >>>>>>>>>>>>>>>>>>>>>>>>> delta=0%, char train=0%, word train=0%, skip >>>>>>>>>>>>>>>>>>>>>>>>> ratio=0%, New best char error >>>>>>>>>>>>>>>>>>>>>>>>> = 0 wrote best >>>>>>>>>>>>>>>>>>>>>>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint >>>>>>>>>>>>>>>>>>>>>>>>> wrote >>>>>>>>>>>>>>>>>>>>>>>>> checkpoint. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> The test with base_checkpoint returns nothing as: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> At iteration 0, stage 0, Eval Char error rate=0, >>>>>>>>>>>>>>>>>>>>>>>>> Word error rate=0 >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> The test with eng.traineddata and e13b.png returns >>>>>>>>>>>>>>>>>>>>>>>>> out.txt. Both files are attached. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Training seems to have worked fine. I don't know >>>>>>>>>>>>>>>>>>>>>>>>> how to translate the test result from >>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint. The generated >>>>>>>>>>>>>>>>>>>>>>>>> eng.traineddata obviously doesn't work well. I >>>>>>>>>>>>>>>>>>>>>>>>> suspect the choice of >>>>>>>>>>>>>>>>>>>>>>>>> --traineddata in combining output files is bad but I >>>>>>>>>>>>>>>>>>>>>>>>> have no clue. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> BTW, I referred to your tess4training in the >>>>>>>>>>>>>>>>>>>>>>>>> process. It helped a lot. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月29日水曜日 19時14分08秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, May 29, 2019 at 3:18 PM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> I wish to make a trained data for E13B font. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> I read the training tutorial and made a >>>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint file according to the method in >>>>>>>>>>>>>>>>>>>>>>>>>>> Training From Scratch. >>>>>>>>>>>>>>>>>>>>>>>>>>> Now, how can I make a trained data from the >>>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint file? >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>>>> You received this message because you are >>>>>>>>>>>>>>>>>>>>>>>>>>> subscribed to the Google Groups "tesseract-ocr" >>>>>>>>>>>>>>>>>>>>>>>>>>> group. >>>>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop >>>>>>>>>>>>>>>>>>>>>>>>>>> receiving emails from it, send an email to >>>>>>>>>>>>>>>>>>>>>>>>>>> tesser...@googlegroups.com. >>>>>>>>>>>>>>>>>>>>>>>>>>> To post to this group, send email to >>>>>>>>>>>>>>>>>>>>>>>>>>> tesser...@googlegroups.com. >>>>>>>>>>>>>>>>>>>>>>>>>>> Visit this group at >>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr. >>>>>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit >>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>>>>>>>>>>>>>>>>>>>>>> . >>>>>>>>>>>>>>>>>>>>>>>>>>> For more options, visit >>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/optout. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> ____________________________________________________________ >>>>>>>>>>>>>>>>>>>>>>>>>> भजन - कीर्तन - आरती @ <a href=" >>>>>>>>>>>>>>>>>>>>>>>>>> http://bhajans.ramparivar.com" rel="nofollow" >>>>>>>>>>>>>>>>>>>>>>>>>> target="_blank" onmousedown="this.href=' >>>>>>>>>>>>>>>>>>>>>>>>>> http://www.google.com/url?q\x3dh >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/454f8f94-6000-4bf5-9129-9682cf1c6f65%40googlegroups.com.