On Wednesday, August 7, 2019 at 2:36:52 AM UTC+2, ElGato ElMago wrote: > > HI, > > I'm thinking of sharing it of course. What is the best way to do it? > After all this, the contribution part of mine is only how I prepared the > training text. Even that is consist of Shree's text and mine. The > instructions and tools I used already exist. > If you have a Github account just create a repo and publish the data and instructions.
> > ElMagoElGato > > 2019年8月7日水曜日 8時20分02秒 UTC+9 Mamadou: > >> Hello, >> Are you planning to release the dataset or models? >> I'm working on the same subject and planning to share both under BSD terms >> >> On Tuesday, August 6, 2019 at 10:11:40 AM UTC+2, ElGato ElMago wrote: >>> >>> Hi, >>> >>> FWIW, I got to the point where I can feel happy with the accuracy. As >>> the images of the previous post show, the symbols, especially on-us symbol >>> and amount symbol, were causing mix-up each other or to another character. >>> I added much more more symbols to the training text and formed words that >>> start with a symbol. One example is as follows: >>> >>> 9;:;=;<;< <0<1<3<4;6;8;9;:;=; >>> >>> >>> I randomly made 8,000 lines like this. In fine-tuning from eng, 5,000 >>> iteration was almost good. Amount symbol still is confused a little when >>> it's followed by 0. Fine tuning tends to be dragged by small particles. >>> I'll have to think of something to make further improvement. >>> >>> Training from scratch produced a bit more stable traineddata. It >>> doesn't get confused with symbols so often but tends to generate extra >>> spaces. By 10,000 iterations, those spaces are gone and recognition became >>> very solid. >>> >>> I thought I might have to do image and box file training but I guess >>> it's not needed this time. >>> >>> ElMagoElGato >>> >>> 2019年7月26日金曜日 14時08分06秒 UTC+9 ElGato ElMago: >>>> >>>> HI, >>>> >>>> Well, I read the description of ScrollView ( >>>> https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging) and >>>> it says: >>>> >>>> To show the characters, deselect DISPLAY/Bounding Boxes, select >>>> DISPLAY/Polygonal Approx and then select OTHER/Uniform display. >>>> >>>> >>>> It basically works. But for some reason, it doesn't work on my e13b >>>> image and ends up with a blue screen. Anyway, it shows each box >>>> separately >>>> when a character is consist of multiple boxes. I'd like to show the box >>>> for the whole character. ScrollView doesn't do it, at least, yet. I'll >>>> do >>>> it on my own. >>>> >>>> ElMagoElGato >>>> >>>> 2019年7月24日水曜日 14時10分46秒 UTC+9 ElGato ElMago: >>>>> >>>>> Hi, >>>>> >>>>> >>>>> I got this result from hocr. This is where one of the phantom >>>>> characters comes from. >>>>> >>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 902 1262 933; x_conf >>>>> 98.864532'><</span> >>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 904 1281 933; x_conf >>>>> 99.018097'>;</span> >>>>> >>>>> >>>>> The firs character is the phantom. It starts with the second >>>>> character that exists on x axis. The first character only has 3 points >>>>> width. I attach ScrollView screen shots that visualize this. >>>>> >>>>> [image: 2019-07-24-132643_854x707_scrot.png][image: >>>>> 2019-07-24-132800_854x707_scrot.png] >>>>> >>>>> >>>>> There seem to be some more cases to cause phantom characters. I'll >>>>> look them in. But I have a trivial question now. I made ScrollView show >>>>> these displays by accidentally clicking Display->Blamer menu. There is >>>>> Bounding Boxes menu below but it ends up showing a blue screen though it >>>>> briefly shows boxes on the way. Can I use this menu at all? It'll be >>>>> very >>>>> useful. >>>>> >>>>> [image: 2019-07-24-140739_854x707_scrot.png] >>>>> >>>>> >>>>> 2019年7月23日火曜日 17時10分36秒 UTC+9 ElGato ElMago: >>>>>> >>>>>> It's great! Perfect! Thanks a lot! >>>>>> >>>>>> 2019年7月23日火曜日 10時56分58秒 UTC+9 shree: >>>>>>> >>>>>>> See https://github.com/tesseract-ocr/tesseract/issues/2580 >>>>>>> >>>>>>> On Tue, 23 Jul 2019, 06:23 ElGato ElMago, <elmago...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I read the output of hocr with lstm_choice_mode = 4 as to the pull >>>>>>>> request 2554. It shows the candidates for each character but doesn't >>>>>>>> show >>>>>>>> bounding box of each character. I only shows the box for a whole word. >>>>>>>> >>>>>>>> I see bounding boxes of each character in comments of the pull >>>>>>>> request 2576. How can I do that? Do I have to look in the source >>>>>>>> code and >>>>>>>> manipulate such an output on my own? >>>>>>>> >>>>>>>> 2019年7月19日金曜日 18時40分49秒 UTC+9 ElGato ElMago: >>>>>>>> >>>>>>>>> Lorenzo, >>>>>>>>> >>>>>>>>> I haven't been checking psm too much. Will turn to those options >>>>>>>>> after I see how it goes with bounding boxes. >>>>>>>>> >>>>>>>>> Shree, >>>>>>>>> >>>>>>>>> I see the merges in the git log and also see that new >>>>>>>>> option lstm_choice_amount works now. I guess my executable is latest >>>>>>>>> though I still see the phantom character. Hocr makes huge and >>>>>>>>> complex >>>>>>>>> output. I'll take some to read it. >>>>>>>>> >>>>>>>>> 2019年7月19日金曜日 18時20分55秒 UTC+9 Claudiu: >>>>>>>>>> >>>>>>>>>> Is there any way to pass bounding boxes to use to the LSTM? We >>>>>>>>>> have an algorithm that cleanly gets bounding boxes of MRZ >>>>>>>>>> characters. >>>>>>>>>> However the results using psm 10 are worse than passing the whole >>>>>>>>>> line in. >>>>>>>>>> Yet when we pass the whole line in we get these phantom characters. >>>>>>>>>> >>>>>>>>>> Should PSM 10 mode work? It often returns “no character” where >>>>>>>>>> there clearly is one. I can supply a test case if it is expected to >>>>>>>>>> work >>>>>>>>>> well. >>>>>>>>>> >>>>>>>>>> On Fri, Jul 19, 2019 at 11:06 AM ElGato ElMago < >>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Lorenzo, >>>>>>>>>>> >>>>>>>>>>> We both have got the same case. It seems a solution to this >>>>>>>>>>> problem would save a lot of people. >>>>>>>>>>> >>>>>>>>>>> Shree, >>>>>>>>>>> >>>>>>>>>>> I pulled the current head of master branch but it doesn't seem >>>>>>>>>>> to contain the merges you pointed that have been merged 3 to 4 days >>>>>>>>>>> ago. >>>>>>>>>>> How can I get them? >>>>>>>>>>> >>>>>>>>>>> ElMagoElGato >>>>>>>>>>> >>>>>>>>>>> 2019年7月19日金曜日 17時02分53秒 UTC+9 Lorenzo Blz: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> PSM 7 was a partial solution for my specific case, it improved >>>>>>>>>>>> the situation but did not solve it. Also I could not use it in >>>>>>>>>>>> some other >>>>>>>>>>>> cases. >>>>>>>>>>>> >>>>>>>>>>>> The proper solution is very likely doing more training with >>>>>>>>>>>> more data, some data augmentation might probably help if data is >>>>>>>>>>>> scarce. >>>>>>>>>>>> Also doing less training might help is the training is not done >>>>>>>>>>>> correctly. >>>>>>>>>>>> >>>>>>>>>>>> There are also similar issues on github: >>>>>>>>>>>> >>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1465 >>>>>>>>>>>> ... >>>>>>>>>>>> >>>>>>>>>>>> The LSTM engine works like this: it scans the image and for >>>>>>>>>>>> each "pixel column" does this: >>>>>>>>>>>> >>>>>>>>>>>> M M M M N M M M [BLANK] F F F F >>>>>>>>>>>> >>>>>>>>>>>> (here i report only the highest probability characters) >>>>>>>>>>>> >>>>>>>>>>>> In the example above an M is partially seen as an N, this is >>>>>>>>>>>> normal, and another step of the algorithm (beam search I think) >>>>>>>>>>>> tries to >>>>>>>>>>>> aggregate back the correct characters. >>>>>>>>>>>> >>>>>>>>>>>> I think cases like this: >>>>>>>>>>>> >>>>>>>>>>>> M M M N N N M M >>>>>>>>>>>> >>>>>>>>>>>> are what gives the phantom characters. More training should >>>>>>>>>>>> reduce the source of the problem or a painful analysis of the >>>>>>>>>>>> bounding >>>>>>>>>>>> boxes might fix some cases. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I used the attached script for the boxes. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Lorenzo >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Il giorno ven 19 lug 2019 alle ore 07:25 ElGato ElMago < >>>>>>>>>>>> elmago...@gmail.com> ha scritto: >>>>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> Let's call them phantom characters then. >>>>>>>>>>>>> >>>>>>>>>>>>> Was psm 7 the solution for the issue 1778? None of the psm >>>>>>>>>>>>> option didn't solve my problem though I see different output. >>>>>>>>>>>>> >>>>>>>>>>>>> I use tesseract 5.0-alpha mostly but 4.1 showed the same >>>>>>>>>>>>> results anyway. How did you get bounding box for each character? >>>>>>>>>>>>> Alto and >>>>>>>>>>>>> lstmbox only show bbox for a group of characters. >>>>>>>>>>>>> >>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>> >>>>>>>>>>>>> 2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz: >>>>>>>>>>>>> >>>>>>>>>>>>>> Phantom characters here for me too: >>>>>>>>>>>>>> >>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1778 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Are you using 4.1? Bounding boxes were fixed in 4.1 maybe >>>>>>>>>>>>>> this was also improved. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I wrote some code that uses symbols iterator to discard >>>>>>>>>>>>>> symbols that are clearly duplicated: too small, overlapping, >>>>>>>>>>>>>> etc. But it >>>>>>>>>>>>>> was not easy to make it work decently and it is not 100% >>>>>>>>>>>>>> reliable with >>>>>>>>>>>>>> false negatives and positives. I cannot share the code and it is >>>>>>>>>>>>>> quite ugly >>>>>>>>>>>>>> anyway. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Here there is another MRZ model with training data: >>>>>>>>>>>>>> >>>>>>>>>>>>>> https://github.com/DoubangoTelecom/tesseractMRZ >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Lorenzo >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu < >>>>>>>>>>>>>> csaf...@gmail.com> ha scritto: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I’m getting the “phantom character” issue as well using the >>>>>>>>>>>>>>> OCRB that Shree trained on MRZ lines. For example for a 0 it >>>>>>>>>>>>>>> will sometimes >>>>>>>>>>>>>>> add both a 0 and an O to the output , thus outputting 45 >>>>>>>>>>>>>>> characters total >>>>>>>>>>>>>>> instead of 44. I haven’t looked at the bounding box output yet >>>>>>>>>>>>>>> but I >>>>>>>>>>>>>>> suspect a phantom thin character is added somewhere that I can >>>>>>>>>>>>>>> discard .. >>>>>>>>>>>>>>> or maybe two chars will have the same bounding box. If anyone >>>>>>>>>>>>>>> else has >>>>>>>>>>>>>>> fixed this issue further up (eg so the output doesn’t contain >>>>>>>>>>>>>>> the phantom >>>>>>>>>>>>>>> characters in the first place) id be interested. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago < >>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'll go back to more of training later. Before doing so, >>>>>>>>>>>>>>>> I'd like to investigate results a little bit. The hocr and >>>>>>>>>>>>>>>> lstmbox options >>>>>>>>>>>>>>>> give some details of positions of characters. The results >>>>>>>>>>>>>>>> show positions >>>>>>>>>>>>>>>> that perfectly correspond to letters in the image. But the >>>>>>>>>>>>>>>> text output >>>>>>>>>>>>>>>> contains a character that obviously does not exist. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Then I found a config file 'lstmdebug' that generates far >>>>>>>>>>>>>>>> more information. I hope it explains what happened with each >>>>>>>>>>>>>>>> character. >>>>>>>>>>>>>>>> I'm yet to read the debug output but I'd appreciate it if >>>>>>>>>>>>>>>> someone could >>>>>>>>>>>>>>>> tell me how to read it because it's really complex. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2019年6月14日金曜日 19時58分49秒 UTC+9 shree: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_MICR >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I have uploaded my files there. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh >>>>>>>>>>>>>>>>> is the bash script that runs the training. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> You can modify as needed. Please note this is for >>>>>>>>>>>>>>>>> legacy/base tesseract --oem 0. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago < >>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thanks a lot, shree. It seems you know everything. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I tried the MICR0.traineddata and the first two >>>>>>>>>>>>>>>>>> mcr.traineddata. The last one was blocked by the browser. >>>>>>>>>>>>>>>>>> Each of the >>>>>>>>>>>>>>>>>> traineddata had mixed results. All of them are getting >>>>>>>>>>>>>>>>>> symbols fairly good >>>>>>>>>>>>>>>>>> but getting spaces randomly and reading some numbers wrong. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> MICR0 seems the best among them. Did you suggest that >>>>>>>>>>>>>>>>>> you'd be able to update it? It gets tripple D very often >>>>>>>>>>>>>>>>>> where there's >>>>>>>>>>>>>>>>>> only one, and so on. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Also, I tried to fine tune from MICR0 but I found that I >>>>>>>>>>>>>>>>>> need to change the language-specific.sh. It specifies some >>>>>>>>>>>>>>>>>> parameters for >>>>>>>>>>>>>>>>>> each language. Do you have any guidance for it? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>>>> http://www.devscope.net/Content/ocrchecks.aspx >>>>>>>>>>>>>>>>>>> https://github.com/BigPino67/Tesseract-MICR-OCR >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago < >>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> That'll be nice if there's traineddata out there but I >>>>>>>>>>>>>>>>>>>> didn't find any. I see free fonts and commercial OCR >>>>>>>>>>>>>>>>>>>> software but not >>>>>>>>>>>>>>>>>>>> traineddata. Tessdata repository obviously doesn't have >>>>>>>>>>>>>>>>>>>> one, either. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Please also search for existing MICR traineddata files. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> So I did several tests from scratch. In the last >>>>>>>>>>>>>>>>>>>>>> attempt, I made a training text with 4,000 lines in the >>>>>>>>>>>>>>>>>>>>>> following format, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> 110004310510< <02 :4002=0181:801= 0008752 <00039 >>>>>>>>>>>>>>>>>>>>>> ;0000001000; >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> and combined it with eng.digits.training_text in >>>>>>>>>>>>>>>>>>>>>> which symbols are converted to E13B symbols. This makes >>>>>>>>>>>>>>>>>>>>>> about 12,000 lines >>>>>>>>>>>>>>>>>>>>>> of training text. It's amazing that this thing >>>>>>>>>>>>>>>>>>>>>> generates a good reader out >>>>>>>>>>>>>>>>>>>>>> of nowhere. But then it is not very good. For example: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134; >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> is a result on the image attached. It's close but >>>>>>>>>>>>>>>>>>>>>> the last '<' in the result text doesn't exist on the >>>>>>>>>>>>>>>>>>>>>> image. It's a small >>>>>>>>>>>>>>>>>>>>>> failure but it causes a greater trouble in parsing. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> What would you suggest from here to increase >>>>>>>>>>>>>>>>>>>>>> accuracy? >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> - Increase the number of lines in the training >>>>>>>>>>>>>>>>>>>>>> text >>>>>>>>>>>>>>>>>>>>>> - Mix up more variations in the training text >>>>>>>>>>>>>>>>>>>>>> - Increase the number of iterations >>>>>>>>>>>>>>>>>>>>>> - Investigate wrong reads one by one >>>>>>>>>>>>>>>>>>>>>> - Or else? >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Also, I referred to engrestrict*.* and could generate >>>>>>>>>>>>>>>>>>>>>> similar result with the fine-tuning-from-full method. >>>>>>>>>>>>>>>>>>>>>> It seems a bit >>>>>>>>>>>>>>>>>>>>>> faster to get to the same level but it also stops at a >>>>>>>>>>>>>>>>>>>>>> 'good' level. I can >>>>>>>>>>>>>>>>>>>>>> go with either way if it takes me to the bright future. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Thanks a lot, Shree. I'll look it in. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> See >>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Look at the files engrestrict*.* and also >>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Create training text of about 100 lines and >>>>>>>>>>>>>>>>>>>>>>>> finetune for 400 lines >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> I had about 14 lines as attached. How many lines >>>>>>>>>>>>>>>>>>>>>>>>> would you recommend? >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Fine tuning gives much better result but it tends >>>>>>>>>>>>>>>>>>>>>>>>> to pick other character than in E13B that only has 14 >>>>>>>>>>>>>>>>>>>>>>>>> characters, 0 through >>>>>>>>>>>>>>>>>>>>>>>>> 9 and 4 symbols. I thought training from scratch >>>>>>>>>>>>>>>>>>>>>>>>> would eliminate such >>>>>>>>>>>>>>>>>>>>>>>>> confusion. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> For training from scratch a large training text >>>>>>>>>>>>>>>>>>>>>>>>>> and hundreds of thousands of iterations are >>>>>>>>>>>>>>>>>>>>>>>>>> recommended. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> If you are just fine tuning for a font try to >>>>>>>>>>>>>>>>>>>>>>>>>> follow instructions for training for impact, with >>>>>>>>>>>>>>>>>>>>>>>>>> your font. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, < >>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, Shree. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I saw the instruction. The steps I made >>>>>>>>>>>>>>>>>>>>>>>>>>> are as follows: >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Using tesstrain.sh: >>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/tesstrain.sh --fonts_dir >>>>>>>>>>>>>>>>>>>>>>>>>>> /usr/share/fonts --lang eng --linedata_only \ >>>>>>>>>>>>>>>>>>>>>>>>>>> --noextract_font_properties --langdata_dir >>>>>>>>>>>>>>>>>>>>>>>>>>> ../langdata \ >>>>>>>>>>>>>>>>>>>>>>>>>>> --tessdata_dir ./tessdata \ >>>>>>>>>>>>>>>>>>>>>>>>>>> --fontlist "E13Bnsd" --output_dir >>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval \ >>>>>>>>>>>>>>>>>>>>>>>>>>> --training_text >>>>>>>>>>>>>>>>>>>>>>>>>>> ../langdata/eng/eng.training_e13b_text >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Training from scratch: >>>>>>>>>>>>>>>>>>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput >>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --debug_interval 100 \ >>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>>>>> --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 >>>>>>>>>>>>>>>>>>>>>>>>>>> Lfx96 Lrx96 Lfx256 O1c111]' \ >>>>>>>>>>>>>>>>>>>>>>>>>>> --model_output ~/tesstutorial/e13boutput/base >>>>>>>>>>>>>>>>>>>>>>>>>>> --learning_rate 20e-4 \ >>>>>>>>>>>>>>>>>>>>>>>>>>> --train_listfile >>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \ >>>>>>>>>>>>>>>>>>>>>>>>>>> --eval_listfile >>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \ >>>>>>>>>>>>>>>>>>>>>>>>>>> --max_iterations 5000 >>>>>>>>>>>>>>>>>>>>>>>>>>> &>~/tesstutorial/e13boutput/basetrain.log >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Test with base_checkpoint: >>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmeval --model >>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>>>>> --eval_listfile >>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Combining output files: >>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --stop_training \ >>>>>>>>>>>>>>>>>>>>>>>>>>> --continue_from >>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>>>>> --model_output >>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/eng.traineddata >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Test with eng.traineddata: >>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract e13b.png out --tessdata-dir >>>>>>>>>>>>>>>>>>>>>>>>>>> /home/koichi/tesstutorial/e13boutput >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> The training from scratch ended as: >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> At iteration 561/2500/2500, Mean rms=0.159%, >>>>>>>>>>>>>>>>>>>>>>>>>>> delta=0%, char train=0%, word train=0%, skip >>>>>>>>>>>>>>>>>>>>>>>>>>> ratio=0%, New best char error >>>>>>>>>>>>>>>>>>>>>>>>>>> = 0 wrote best >>>>>>>>>>>>>>>>>>>>>>>>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint >>>>>>>>>>>>>>>>>>>>>>>>>>> wrote >>>>>>>>>>>>>>>>>>>>>>>>>>> checkpoint. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> The test with base_checkpoint returns nothing as: >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> At iteration 0, stage 0, Eval Char error rate=0, >>>>>>>>>>>>>>>>>>>>>>>>>>> Word error rate=0 >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> The test with eng.traineddata and e13b.png >>>>>>>>>>>>>>>>>>>>>>>>>>> returns out.txt. Both files are attached. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Training seems to have worked fine. I don't >>>>>>>>>>>>>>>>>>>>>>>>>>> know how to translate the test result from >>>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint. The generated >>>>>>>>>>>>>>>>>>>>>>>>>>> eng.traineddata obviously doesn't work well. I >>>>>>>>>>>>>>>>>>>>>>>>>>> suspect the choice of >>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata in combining output files is bad but >>>>>>>>>>>>>>>>>>>>>>>>>>> I have no clue. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> BTW, I referred to your tess4training in the >>>>>>>>>>>>>>>>>>>>>>>>>>> process. It helped a lot. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月29日水曜日 19時14分08秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, May 29, 2019 at 3:18 PM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I wish to make a trained data for E13B font. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I read the training tutorial and made a >>>>>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint file according to the method in >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Training From Scratch. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Now, how can I make a trained data from the >>>>>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint file?</di >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4a85b67c-c9fe-47b9-94e3-576e2ebc89e3%40googlegroups.com.