Here's my sharing on GitHub. Hope it's of any use for somebody. https://github.com/ElMagoElGato/tess_e13b_training
2019年8月8日木曜日 9時35分17秒 UTC+9 ElGato ElMago: > > OK, I'll do so. I need to reorganize naming and so on a little bit. Will > be out there soon. > > 2019年8月7日水曜日 21時11分01秒 UTC+9 Mamadou: >> >> >> >> On Wednesday, August 7, 2019 at 2:36:52 AM UTC+2, ElGato ElMago wrote: >>> >>> HI, >>> >>> I'm thinking of sharing it of course. What is the best way to do it? >>> After all this, the contribution part of mine is only how I prepared the >>> training text. Even that is consist of Shree's text and mine. The >>> instructions and tools I used already exist. >>> >> If you have a Github account just create a repo and publish the data and >> instructions. >> >>> >>> ElMagoElGato >>> >>> 2019年8月7日水曜日 8時20分02秒 UTC+9 Mamadou: >>> >>>> Hello, >>>> Are you planning to release the dataset or models? >>>> I'm working on the same subject and planning to share both under BSD >>>> terms >>>> >>>> On Tuesday, August 6, 2019 at 10:11:40 AM UTC+2, ElGato ElMago wrote: >>>>> >>>>> Hi, >>>>> >>>>> FWIW, I got to the point where I can feel happy with the accuracy. As >>>>> the images of the previous post show, the symbols, especially on-us >>>>> symbol >>>>> and amount symbol, were causing mix-up each other or to another >>>>> character. >>>>> I added much more more symbols to the training text and formed words that >>>>> start with a symbol. One example is as follows: >>>>> >>>>> 9;:;=;<;< <0<1<3<4;6;8;9;:;=; >>>>> >>>>> >>>>> I randomly made 8,000 lines like this. In fine-tuning from eng, 5,000 >>>>> iteration was almost good. Amount symbol still is confused a little when >>>>> it's followed by 0. Fine tuning tends to be dragged by small particles. >>>>> I'll have to think of something to make further improvement. >>>>> >>>>> Training from scratch produced a bit more stable traineddata. It >>>>> doesn't get confused with symbols so often but tends to generate extra >>>>> spaces. By 10,000 iterations, those spaces are gone and recognition >>>>> became >>>>> very solid. >>>>> >>>>> I thought I might have to do image and box file training but I guess >>>>> it's not needed this time. >>>>> >>>>> ElMagoElGato >>>>> >>>>> 2019年7月26日金曜日 14時08分06秒 UTC+9 ElGato ElMago: >>>>>> >>>>>> HI, >>>>>> >>>>>> Well, I read the description of ScrollView ( >>>>>> https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging) and >>>>>> it says: >>>>>> >>>>>> To show the characters, deselect DISPLAY/Bounding Boxes, select >>>>>> DISPLAY/Polygonal Approx and then select OTHER/Uniform display. >>>>>> >>>>>> >>>>>> It basically works. But for some reason, it doesn't work on my e13b >>>>>> image and ends up with a blue screen. Anyway, it shows each box >>>>>> separately >>>>>> when a character is consist of multiple boxes. I'd like to show the box >>>>>> for the whole character. ScrollView doesn't do it, at least, yet. I'll >>>>>> do >>>>>> it on my own. >>>>>> >>>>>> ElMagoElGato >>>>>> >>>>>> 2019年7月24日水曜日 14時10分46秒 UTC+9 ElGato ElMago: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> >>>>>>> I got this result from hocr. This is where one of the phantom >>>>>>> characters comes from. >>>>>>> >>>>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 902 1262 933; x_conf >>>>>>> 98.864532'><</span> >>>>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 904 1281 933; x_conf >>>>>>> 99.018097'>;</span> >>>>>>> >>>>>>> >>>>>>> The firs character is the phantom. It starts with the second >>>>>>> character that exists on x axis. The first character only has 3 points >>>>>>> width. I attach ScrollView screen shots that visualize this. >>>>>>> >>>>>>> [image: 2019-07-24-132643_854x707_scrot.png][image: >>>>>>> 2019-07-24-132800_854x707_scrot.png] >>>>>>> >>>>>>> >>>>>>> There seem to be some more cases to cause phantom characters. I'll >>>>>>> look them in. But I have a trivial question now. I made ScrollView >>>>>>> show >>>>>>> these displays by accidentally clicking Display->Blamer menu. There is >>>>>>> Bounding Boxes menu below but it ends up showing a blue screen though >>>>>>> it >>>>>>> briefly shows boxes on the way. Can I use this menu at all? It'll be >>>>>>> very >>>>>>> useful. >>>>>>> >>>>>>> [image: 2019-07-24-140739_854x707_scrot.png] >>>>>>> >>>>>>> >>>>>>> 2019年7月23日火曜日 17時10分36秒 UTC+9 ElGato ElMago: >>>>>>>> >>>>>>>> It's great! Perfect! Thanks a lot! >>>>>>>> >>>>>>>> 2019年7月23日火曜日 10時56分58秒 UTC+9 shree: >>>>>>>>> >>>>>>>>> See https://github.com/tesseract-ocr/tesseract/issues/2580 >>>>>>>>> >>>>>>>>> On Tue, 23 Jul 2019, 06:23 ElGato ElMago, <elmago...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I read the output of hocr with lstm_choice_mode = 4 as to the >>>>>>>>>> pull request 2554. It shows the candidates for each character but >>>>>>>>>> doesn't >>>>>>>>>> show bounding box of each character. I only shows the box for a >>>>>>>>>> whole word. >>>>>>>>>> >>>>>>>>>> I see bounding boxes of each character in comments of the pull >>>>>>>>>> request 2576. How can I do that? Do I have to look in the source >>>>>>>>>> code and >>>>>>>>>> manipulate such an output on my own? >>>>>>>>>> >>>>>>>>>> 2019年7月19日金曜日 18時40分49秒 UTC+9 ElGato ElMago: >>>>>>>>>> >>>>>>>>>>> Lorenzo, >>>>>>>>>>> >>>>>>>>>>> I haven't been checking psm too much. Will turn to those >>>>>>>>>>> options after I see how it goes with bounding boxes. >>>>>>>>>>> >>>>>>>>>>> Shree, >>>>>>>>>>> >>>>>>>>>>> I see the merges in the git log and also see that new >>>>>>>>>>> option lstm_choice_amount works now. I guess my executable is >>>>>>>>>>> latest >>>>>>>>>>> though I still see the phantom character. Hocr makes huge and >>>>>>>>>>> complex >>>>>>>>>>> output. I'll take some to read it. >>>>>>>>>>> >>>>>>>>>>> 2019年7月19日金曜日 18時20分55秒 UTC+9 Claudiu: >>>>>>>>>>>> >>>>>>>>>>>> Is there any way to pass bounding boxes to use to the LSTM? We >>>>>>>>>>>> have an algorithm that cleanly gets bounding boxes of MRZ >>>>>>>>>>>> characters. >>>>>>>>>>>> However the results using psm 10 are worse than passing the whole >>>>>>>>>>>> line in. >>>>>>>>>>>> Yet when we pass the whole line in we get these phantom >>>>>>>>>>>> characters. >>>>>>>>>>>> >>>>>>>>>>>> Should PSM 10 mode work? It often returns “no character” where >>>>>>>>>>>> there clearly is one. I can supply a test case if it is expected >>>>>>>>>>>> to work >>>>>>>>>>>> well. >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Jul 19, 2019 at 11:06 AM ElGato ElMago < >>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Lorenzo, >>>>>>>>>>>>> >>>>>>>>>>>>> We both have got the same case. It seems a solution to this >>>>>>>>>>>>> problem would save a lot of people. >>>>>>>>>>>>> >>>>>>>>>>>>> Shree, >>>>>>>>>>>>> >>>>>>>>>>>>> I pulled the current head of master branch but it doesn't seem >>>>>>>>>>>>> to contain the merges you pointed that have been merged 3 to 4 >>>>>>>>>>>>> days ago. >>>>>>>>>>>>> How can I get them? >>>>>>>>>>>>> >>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>> >>>>>>>>>>>>> 2019年7月19日金曜日 17時02分53秒 UTC+9 Lorenzo Blz: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> PSM 7 was a partial solution for my specific case, it >>>>>>>>>>>>>> improved the situation but did not solve it. Also I could not >>>>>>>>>>>>>> use it in >>>>>>>>>>>>>> some other cases. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The proper solution is very likely doing more training with >>>>>>>>>>>>>> more data, some data augmentation might probably help if data is >>>>>>>>>>>>>> scarce. >>>>>>>>>>>>>> Also doing less training might help is the training is not >>>>>>>>>>>>>> done correctly. >>>>>>>>>>>>>> >>>>>>>>>>>>>> There are also similar issues on github: >>>>>>>>>>>>>> >>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1465 >>>>>>>>>>>>>> ... >>>>>>>>>>>>>> >>>>>>>>>>>>>> The LSTM engine works like this: it scans the image and for >>>>>>>>>>>>>> each "pixel column" does this: >>>>>>>>>>>>>> >>>>>>>>>>>>>> M M M M N M M M [BLANK] F F F F >>>>>>>>>>>>>> >>>>>>>>>>>>>> (here i report only the highest probability characters) >>>>>>>>>>>>>> >>>>>>>>>>>>>> In the example above an M is partially seen as an N, this is >>>>>>>>>>>>>> normal, and another step of the algorithm (beam search I think) >>>>>>>>>>>>>> tries to >>>>>>>>>>>>>> aggregate back the correct characters. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I think cases like this: >>>>>>>>>>>>>> >>>>>>>>>>>>>> M M M N N N M M >>>>>>>>>>>>>> >>>>>>>>>>>>>> are what gives the phantom characters. More training should >>>>>>>>>>>>>> reduce the source of the problem or a painful analysis of the >>>>>>>>>>>>>> bounding >>>>>>>>>>>>>> boxes might fix some cases. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I used the attached script for the boxes. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Lorenzo >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Il giorno ven 19 lug 2019 alle ore 07:25 ElGato ElMago < >>>>>>>>>>>>>> elmago...@gmail.com> ha scritto: >>>>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Let's call them phantom characters then. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Was psm 7 the solution for the issue 1778? None of the psm >>>>>>>>>>>>>>> option didn't solve my problem though I see different output. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I use tesseract 5.0-alpha mostly but 4.1 showed the same >>>>>>>>>>>>>>> results anyway. How did you get bounding box for each >>>>>>>>>>>>>>> character? Alto and >>>>>>>>>>>>>>> lstmbox only show bbox for a group of characters. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Phantom characters here for me too: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1778 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Are you using 4.1? Bounding boxes were fixed in 4.1 maybe >>>>>>>>>>>>>>>> this was also improved. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I wrote some code that uses symbols iterator to discard >>>>>>>>>>>>>>>> symbols that are clearly duplicated: too small, overlapping, >>>>>>>>>>>>>>>> etc. But it >>>>>>>>>>>>>>>> was not easy to make it work decently and it is not 100% >>>>>>>>>>>>>>>> reliable with >>>>>>>>>>>>>>>> false negatives and positives. I cannot share the code and it >>>>>>>>>>>>>>>> is quite ugly >>>>>>>>>>>>>>>> anyway. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Here there is another MRZ model with training data: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> https://github.com/DoubangoTelecom/tesseractMRZ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Lorenzo >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu < >>>>>>>>>>>>>>>> csaf...@gmail.com> ha scritto: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I’m getting the “phantom character” issue as well using >>>>>>>>>>>>>>>>> the OCRB that Shree trained on MRZ lines. For example for a 0 >>>>>>>>>>>>>>>>> it will >>>>>>>>>>>>>>>>> sometimes add both a 0 and an O to the output , thus >>>>>>>>>>>>>>>>> outputting 45 >>>>>>>>>>>>>>>>> characters total instead of 44. I haven’t looked at the >>>>>>>>>>>>>>>>> bounding box output >>>>>>>>>>>>>>>>> yet but I suspect a phantom thin character is added somewhere >>>>>>>>>>>>>>>>> that I can >>>>>>>>>>>>>>>>> discard .. or maybe two chars will have the same bounding >>>>>>>>>>>>>>>>> box. If anyone >>>>>>>>>>>>>>>>> else has fixed this issue further up (eg so the output >>>>>>>>>>>>>>>>> doesn’t contain the >>>>>>>>>>>>>>>>> phantom characters in the first place) id be interested. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago < >>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I'll go back to more of training later. Before doing so, >>>>>>>>>>>>>>>>>> I'd like to investigate results a little bit. The hocr and >>>>>>>>>>>>>>>>>> lstmbox options >>>>>>>>>>>>>>>>>> give some details of positions of characters. The results >>>>>>>>>>>>>>>>>> show positions >>>>>>>>>>>>>>>>>> that perfectly correspond to letters in the image. But the >>>>>>>>>>>>>>>>>> text output >>>>>>>>>>>>>>>>>> contains a character that obviously does not exist. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Then I found a config file 'lstmdebug' that generates far >>>>>>>>>>>>>>>>>> more information. I hope it explains what happened with >>>>>>>>>>>>>>>>>> each character. >>>>>>>>>>>>>>>>>> I'm yet to read the debug output but I'd appreciate it if >>>>>>>>>>>>>>>>>> someone could >>>>>>>>>>>>>>>>>> tell me how to read it because it's really complex. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 2019年6月14日金曜日 19時58分49秒 UTC+9 shree: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_MICR >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I have uploaded my files there. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh >>>>>>>>>>>>>>>>>>> is the bash script that runs the training. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> You can modify as needed. Please note this is for >>>>>>>>>>>>>>>>>>> legacy/base tesseract --oem 0. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago < >>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks a lot, shree. It seems you know everything. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I tried the MICR0.traineddata and the first two >>>>>>>>>>>>>>>>>>>> mcr.traineddata. The last one was blocked by the browser. >>>>>>>>>>>>>>>>>>>> Each of the >>>>>>>>>>>>>>>>>>>> traineddata had mixed results. All of them are getting >>>>>>>>>>>>>>>>>>>> symbols fairly good >>>>>>>>>>>>>>>>>>>> but getting spaces randomly and reading some numbers wrong. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> MICR0 seems the best among them. Did you suggest that >>>>>>>>>>>>>>>>>>>> you'd be able to update it? It gets tripple D very often >>>>>>>>>>>>>>>>>>>> where there's >>>>>>>>>>>>>>>>>>>> only one, and so on. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Also, I tried to fine tune from MICR0 but I found that >>>>>>>>>>>>>>>>>>>> I need to change the language-specific.sh. It specifies >>>>>>>>>>>>>>>>>>>> some parameters >>>>>>>>>>>>>>>>>>>> for each language. Do you have any guidance for it? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>>>>>> http://www.devscope.net/Content/ocrchecks.aspx >>>>>>>>>>>>>>>>>>>>> https://github.com/BigPino67/Tesseract-MICR-OCR >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> That'll be nice if there's traineddata out there >>>>>>>>>>>>>>>>>>>>>> but I didn't find any. I see free fonts and commercial >>>>>>>>>>>>>>>>>>>>>> OCR software but >>>>>>>>>>>>>>>>>>>>>> not traineddata. Tessdata repository obviously doesn't >>>>>>>>>>>>>>>>>>>>>> have one, either. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Please also search for existing MICR traineddata >>>>>>>>>>>>>>>>>>>>>>> files. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> So I did several tests from scratch. In the last >>>>>>>>>>>>>>>>>>>>>>>> attempt, I made a training text with 4,000 lines in >>>>>>>>>>>>>>>>>>>>>>>> the following format, >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> 110004310510< <02 :4002=0181:801= 0008752 <00039 >>>>>>>>>>>>>>>>>>>>>>>> ;0000001000; >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> and combined it with eng.digits.training_text in >>>>>>>>>>>>>>>>>>>>>>>> which symbols are converted to E13B symbols. This >>>>>>>>>>>>>>>>>>>>>>>> makes about 12,000 lines >>>>>>>>>>>>>>>>>>>>>>>> of training text. It's amazing that this thing >>>>>>>>>>>>>>>>>>>>>>>> generates a good reader out >>>>>>>>>>>>>>>>>>>>>>>> of nowhere. But then it is not very good. For >>>>>>>>>>>>>>>>>>>>>>>> example: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134; >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> is a result on the image attached. It's close but >>>>>>>>>>>>>>>>>>>>>>>> the last '<' in the result text doesn't exist on the >>>>>>>>>>>>>>>>>>>>>>>> image. It's a small >>>>>>>>>>>>>>>>>>>>>>>> failure but it causes a greater trouble in parsing. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> What would you suggest from here to increase >>>>>>>>>>>>>>>>>>>>>>>> accuracy? >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> - Increase the number of lines in the training >>>>>>>>>>>>>>>>>>>>>>>> text >>>>>>>>>>>>>>>>>>>>>>>> - Mix up more variations in the training text >>>>>>>>>>>>>>>>>>>>>>>> - Increase the number of iterations >>>>>>>>>>>>>>>>>>>>>>>> - Investigate wrong reads one by one >>>>>>>>>>>>>>>>>>>>>>>> - Or else? >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Also, I referred to engrestrict*.* and could >>>>>>>>>>>>>>>>>>>>>>>> generate similar result with the fine-tuning-from-full >>>>>>>>>>>>>>>>>>>>>>>> method. It seems a >>>>>>>>>>>>>>>>>>>>>>>> bit faster to get to the same level but it also stops >>>>>>>>>>>>>>>>>>>>>>>> at a 'good' level. I >>>>>>>>>>>>>>>>>>>>>>>> can go with either way if it takes me to the bright >>>>>>>>>>>>>>>>>>>>>>>> future. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks a lot, Shree. I'll look it in. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> See >>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Look at the files engrestrict*.* and also >>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Create training text of about 100 lines and >>>>>>>>>>>>>>>>>>>>>>>>>> finetune for 400 lines >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> I had about 14 lines as attached. How many >>>>>>>>>>>>>>>>>>>>>>>>>>> lines would you recommend? >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Fine tuning gives much better result but it >>>>>>>>>>>>>>>>>>>>>>>>>>> tends to pick other character than in E13B that >>>>>>>>>>>>>>>>>>>>>>>>>>> only has 14 characters, 0 >>>>>>>>>>>>>>>>>>>>>>>>>>> through 9 and 4 symbols. I thought training from >>>>>>>>>>>>>>>>>>>>>>>>>>> scratch would eliminate >>>>>>>>>>>>>>>>>>>>>>>>>>> such confusion. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> For training from scratch a large training text >>>>>>>>>>>>>>>>>>>>>>>>>>>> and hundreds of thousands of iterations are >>>>>>>>>>>>>>>>>>>>>>>>>>>> recommended. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> If you are just fine tuning for a font try to >>>>>>>>>>>>>>>>>>>>>>>>>>>> follow instructions for training for impact, with >>>>>>>>>>>>>>>>>>>>>>>>>>>> your font. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, < >>>>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, Shree. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I saw the instruction. The steps I made >>>>>>>>>>>>>>>>>>>>>>>>>>>>> are as follows: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Using tesstrain.sh: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/tesstrain.sh --fonts_dir >>>>>>>>>>>>>>>>>>>>>>>>>>>>> /usr/share/fonts --lang eng --linedata_only \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>> --noextract_font_properties --langdata_dir >>>>>>>>>>>>>>>>>>>>>>>>>>>>> ../langdata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>> --tessdata_dir ./tessdata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>> --fontlist "E13Bnsd" --output_dir >>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>> --training_text >>>>>>>>>>>>>>>>>>>>>>>>>>>>> ../langdata/eng/eng.training_e13b_text >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Training from scratch: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput >>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --debug_interval 100 >>>>>>>>>>>>>>>>>>>>>>>>>>>>> \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>> --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Lfx96 Lrx96 Lfx256 O1c111]' \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>> --model_output >>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base --learning_rate >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 20e-4 \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>> --train_listfile >>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>> --eval_listfile >>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>> --max_iterations 5000 >>>>>>>>>>>>>>>>>>>>>>>>>>>>> &>~/tesstutorial/e13boutput/basetrain.log >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Test with base_checkpoint: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmeval --model >>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>> --eval_listfile >>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Combining output files: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --stop_training \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>> --continue_from >>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>> --model_output >>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/eng.traineddata >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Test with eng.traineddata: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract e13b.png out --tessdata-dir >>>>>>>>>>>>>>>>>>>>>>>>>>>>> /home/koichi/tesstutorial/e13boutput >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> The training from scratch ended as: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> At iteration 561/2500/2500, Mean rms=0.159%, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> delta=0%, char train=0%, word train=0%, skip >>>>>>>>>>>>>>>>>>>>>>>>>>>>> ratio=0%, New best char error >>>>>>>>>>>>>>>>>>>>>>>>>>>>> = 0 wrote best >>>>>>>>>>>>>>>>>>>>>>>>>>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint >>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote >>>>>>>>>>>>>>>>>>>>>>>>>>>>> checkpoint. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> The test with base_checkpoint returns nothing >>>>>>>>>>>>>>>>>>>>>>>>>>>>> as: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> At iteration 0, stage 0, Eval Char error >>>>>>>>>>>>>>>>>>>>>>>>>>>>> rate=0, Word error rate=0 >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> The test with eng.traineddata and e13b.png >>>>>>>>>>>>>>>>>>>>>>>>>>>>> returns out.txt. Both files are attached. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Training seems to have worked fine. I don't >>>>>>>>>>>>>>>>>>>>>>>>>>>>> know how to translate the test result from >>>>>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint. The generated >>>>>>>>>>>>>>>>>>>>>>>>>>>>> eng.traineddata obviously doesn't work well. I >>>>>>>>>>>>>>>>>>>>>>>>>>>>> suspect the choice of >>>>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata in combining output files is bad >>>>>>>>>>>>>>>>>>>>>>>>>>>>> but I have no clue. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> BTW, I referred to your tess4training in the >>>>>>>>>>>>>>>>>>>>>>>>>>>>> process. It helped a lot. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月29日水曜日 19時14分08秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> see <a style="font-family: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Arial,Helvetica,sans-serif; font-size: small;" >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> onmousedown="this.href=' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2Ftesseract-ocr%2Ftesseract%2Fwiki%2FTrainingTesseract-4.00%23combining-the-output-files\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNE52zlo1Ag3z7wNDKcmFL3rMf5LXQ';return >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://www.google.com/url?q%5Cx3dhttps%3A%2F%2Fgithub.com%2Ftesseract-ocr%2Ftesseract%2Fwiki%2FTrainingTesseract-4.00%23combining-the-output-files%5Cx26sa%5Cx3dD%5Cx26sntz%5Cx3d1%5Cx26usg%5Cx3dAFQjCNE52zlo1Ag3z7wNDKcmFL3rMf5LXQ';return> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> true;" onclick="this.href=' >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://www.google.com/url?q\x3dhttps%3A%2F%2Fgithub.com%2Ftesseract-ocr%2Ftesseract%2Fwiki%2FTrainingTesseract-4.00%23combining-the-output-files\x26sa\x3dD\x26sntz\x3d1\x26usg\x3dAFQjCNE52zlo1Ag3z7wNDKcmFL3rMf5LXQ';retur >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://www.google.com/url?q%5Cx3dhttps%3A%2F%2Fgithub.com%2Ftesseract-ocr%2Ftesseract%2Fwiki%2FTrainingTesseract-4.00%23combining-the-output-files%5Cx26sa%5Cx3dD%5Cx26sntz%5Cx3d1%5Cx26usg%5Cx3dAFQjCNE52zlo1Ag3z7wNDKcmFL3rMf5LXQ';retur> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/01d5a358-e151-40dc-9662-f6d604c334a2%40googlegroups.com.