OK, I'll do so. I need to reorganize naming and so on a little bit. Will be out there soon.
2019年8月7日水曜日 21時11分01秒 UTC+9 Mamadou: > > > > On Wednesday, August 7, 2019 at 2:36:52 AM UTC+2, ElGato ElMago wrote: >> >> HI, >> >> I'm thinking of sharing it of course. What is the best way to do it? >> After all this, the contribution part of mine is only how I prepared the >> training text. Even that is consist of Shree's text and mine. The >> instructions and tools I used already exist. >> > If you have a Github account just create a repo and publish the data and > instructions. > >> >> ElMagoElGato >> >> 2019年8月7日水曜日 8時20分02秒 UTC+9 Mamadou: >> >>> Hello, >>> Are you planning to release the dataset or models? >>> I'm working on the same subject and planning to share both under BSD >>> terms >>> >>> On Tuesday, August 6, 2019 at 10:11:40 AM UTC+2, ElGato ElMago wrote: >>>> >>>> Hi, >>>> >>>> FWIW, I got to the point where I can feel happy with the accuracy. As >>>> the images of the previous post show, the symbols, especially on-us symbol >>>> and amount symbol, were causing mix-up each other or to another character. >>>> >>>> I added much more more symbols to the training text and formed words that >>>> start with a symbol. One example is as follows: >>>> >>>> 9;:;=;<;< <0<1<3<4;6;8;9;:;=; >>>> >>>> >>>> I randomly made 8,000 lines like this. In fine-tuning from eng, 5,000 >>>> iteration was almost good. Amount symbol still is confused a little when >>>> it's followed by 0. Fine tuning tends to be dragged by small particles. >>>> I'll have to think of something to make further improvement. >>>> >>>> Training from scratch produced a bit more stable traineddata. It >>>> doesn't get confused with symbols so often but tends to generate extra >>>> spaces. By 10,000 iterations, those spaces are gone and recognition >>>> became >>>> very solid. >>>> >>>> I thought I might have to do image and box file training but I guess >>>> it's not needed this time. >>>> >>>> ElMagoElGato >>>> >>>> 2019年7月26日金曜日 14時08分06秒 UTC+9 ElGato ElMago: >>>>> >>>>> HI, >>>>> >>>>> Well, I read the description of ScrollView ( >>>>> https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging) and >>>>> it says: >>>>> >>>>> To show the characters, deselect DISPLAY/Bounding Boxes, select >>>>> DISPLAY/Polygonal Approx and then select OTHER/Uniform display. >>>>> >>>>> >>>>> It basically works. But for some reason, it doesn't work on my e13b >>>>> image and ends up with a blue screen. Anyway, it shows each box >>>>> separately >>>>> when a character is consist of multiple boxes. I'd like to show the box >>>>> for the whole character. ScrollView doesn't do it, at least, yet. I'll >>>>> do >>>>> it on my own. >>>>> >>>>> ElMagoElGato >>>>> >>>>> 2019年7月24日水曜日 14時10分46秒 UTC+9 ElGato ElMago: >>>>>> >>>>>> Hi, >>>>>> >>>>>> >>>>>> I got this result from hocr. This is where one of the phantom >>>>>> characters comes from. >>>>>> >>>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 902 1262 933; x_conf >>>>>> 98.864532'><</span> >>>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 904 1281 933; x_conf >>>>>> 99.018097'>;</span> >>>>>> >>>>>> >>>>>> The firs character is the phantom. It starts with the second >>>>>> character that exists on x axis. The first character only has 3 points >>>>>> width. I attach ScrollView screen shots that visualize this. >>>>>> >>>>>> [image: 2019-07-24-132643_854x707_scrot.png][image: >>>>>> 2019-07-24-132800_854x707_scrot.png] >>>>>> >>>>>> >>>>>> There seem to be some more cases to cause phantom characters. I'll >>>>>> look them in. But I have a trivial question now. I made ScrollView >>>>>> show >>>>>> these displays by accidentally clicking Display->Blamer menu. There is >>>>>> Bounding Boxes menu below but it ends up showing a blue screen though it >>>>>> briefly shows boxes on the way. Can I use this menu at all? It'll be >>>>>> very >>>>>> useful. >>>>>> >>>>>> [image: 2019-07-24-140739_854x707_scrot.png] >>>>>> >>>>>> >>>>>> 2019年7月23日火曜日 17時10分36秒 UTC+9 ElGato ElMago: >>>>>>> >>>>>>> It's great! Perfect! Thanks a lot! >>>>>>> >>>>>>> 2019年7月23日火曜日 10時56分58秒 UTC+9 shree: >>>>>>>> >>>>>>>> See https://github.com/tesseract-ocr/tesseract/issues/2580 >>>>>>>> >>>>>>>> On Tue, 23 Jul 2019, 06:23 ElGato ElMago, <elmago...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I read the output of hocr with lstm_choice_mode = 4 as to the pull >>>>>>>>> request 2554. It shows the candidates for each character but doesn't >>>>>>>>> show >>>>>>>>> bounding box of each character. I only shows the box for a whole >>>>>>>>> word. >>>>>>>>> >>>>>>>>> I see bounding boxes of each character in comments of the pull >>>>>>>>> request 2576. How can I do that? Do I have to look in the source >>>>>>>>> code and >>>>>>>>> manipulate such an output on my own? >>>>>>>>> >>>>>>>>> 2019年7月19日金曜日 18時40分49秒 UTC+9 ElGato ElMago: >>>>>>>>> >>>>>>>>>> Lorenzo, >>>>>>>>>> >>>>>>>>>> I haven't been checking psm too much. Will turn to those options >>>>>>>>>> after I see how it goes with bounding boxes. >>>>>>>>>> >>>>>>>>>> Shree, >>>>>>>>>> >>>>>>>>>> I see the merges in the git log and also see that new >>>>>>>>>> option lstm_choice_amount works now. I guess my executable is >>>>>>>>>> latest >>>>>>>>>> though I still see the phantom character. Hocr makes huge and >>>>>>>>>> complex >>>>>>>>>> output. I'll take some to read it. >>>>>>>>>> >>>>>>>>>> 2019年7月19日金曜日 18時20分55秒 UTC+9 Claudiu: >>>>>>>>>>> >>>>>>>>>>> Is there any way to pass bounding boxes to use to the LSTM? We >>>>>>>>>>> have an algorithm that cleanly gets bounding boxes of MRZ >>>>>>>>>>> characters. >>>>>>>>>>> However the results using psm 10 are worse than passing the whole >>>>>>>>>>> line in. >>>>>>>>>>> Yet when we pass the whole line in we get these phantom characters. >>>>>>>>>>> >>>>>>>>>>> Should PSM 10 mode work? It often returns “no character” where >>>>>>>>>>> there clearly is one. I can supply a test case if it is expected to >>>>>>>>>>> work >>>>>>>>>>> well. >>>>>>>>>>> >>>>>>>>>>> On Fri, Jul 19, 2019 at 11:06 AM ElGato ElMago < >>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Lorenzo, >>>>>>>>>>>> >>>>>>>>>>>> We both have got the same case. It seems a solution to this >>>>>>>>>>>> problem would save a lot of people. >>>>>>>>>>>> >>>>>>>>>>>> Shree, >>>>>>>>>>>> >>>>>>>>>>>> I pulled the current head of master branch but it doesn't seem >>>>>>>>>>>> to contain the merges you pointed that have been merged 3 to 4 >>>>>>>>>>>> days ago. >>>>>>>>>>>> How can I get them? >>>>>>>>>>>> >>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>> >>>>>>>>>>>> 2019年7月19日金曜日 17時02分53秒 UTC+9 Lorenzo Blz: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> PSM 7 was a partial solution for my specific case, it improved >>>>>>>>>>>>> the situation but did not solve it. Also I could not use it in >>>>>>>>>>>>> some other >>>>>>>>>>>>> cases. >>>>>>>>>>>>> >>>>>>>>>>>>> The proper solution is very likely doing more training with >>>>>>>>>>>>> more data, some data augmentation might probably help if data is >>>>>>>>>>>>> scarce. >>>>>>>>>>>>> Also doing less training might help is the training is not >>>>>>>>>>>>> done correctly. >>>>>>>>>>>>> >>>>>>>>>>>>> There are also similar issues on github: >>>>>>>>>>>>> >>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1465 >>>>>>>>>>>>> ... >>>>>>>>>>>>> >>>>>>>>>>>>> The LSTM engine works like this: it scans the image and for >>>>>>>>>>>>> each "pixel column" does this: >>>>>>>>>>>>> >>>>>>>>>>>>> M M M M N M M M [BLANK] F F F F >>>>>>>>>>>>> >>>>>>>>>>>>> (here i report only the highest probability characters) >>>>>>>>>>>>> >>>>>>>>>>>>> In the example above an M is partially seen as an N, this is >>>>>>>>>>>>> normal, and another step of the algorithm (beam search I think) >>>>>>>>>>>>> tries to >>>>>>>>>>>>> aggregate back the correct characters. >>>>>>>>>>>>> >>>>>>>>>>>>> I think cases like this: >>>>>>>>>>>>> >>>>>>>>>>>>> M M M N N N M M >>>>>>>>>>>>> >>>>>>>>>>>>> are what gives the phantom characters. More training should >>>>>>>>>>>>> reduce the source of the problem or a painful analysis of the >>>>>>>>>>>>> bounding >>>>>>>>>>>>> boxes might fix some cases. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I used the attached script for the boxes. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Lorenzo >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Il giorno ven 19 lug 2019 alle ore 07:25 ElGato ElMago < >>>>>>>>>>>>> elmago...@gmail.com> ha scritto: >>>>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Let's call them phantom characters then. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Was psm 7 the solution for the issue 1778? None of the psm >>>>>>>>>>>>>> option didn't solve my problem though I see different output. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I use tesseract 5.0-alpha mostly but 4.1 showed the same >>>>>>>>>>>>>> results anyway. How did you get bounding box for each >>>>>>>>>>>>>> character? Alto and >>>>>>>>>>>>>> lstmbox only show bbox for a group of characters. >>>>>>>>>>>>>> >>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Phantom characters here for me too: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1778 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Are you using 4.1? Bounding boxes were fixed in 4.1 maybe >>>>>>>>>>>>>>> this was also improved. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I wrote some code that uses symbols iterator to discard >>>>>>>>>>>>>>> symbols that are clearly duplicated: too small, overlapping, >>>>>>>>>>>>>>> etc. But it >>>>>>>>>>>>>>> was not easy to make it work decently and it is not 100% >>>>>>>>>>>>>>> reliable with >>>>>>>>>>>>>>> false negatives and positives. I cannot share the code and it >>>>>>>>>>>>>>> is quite ugly >>>>>>>>>>>>>>> anyway. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Here there is another MRZ model with training data: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> https://github.com/DoubangoTelecom/tesseractMRZ >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Lorenzo >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu < >>>>>>>>>>>>>>> csaf...@gmail.com> ha scritto: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I’m getting the “phantom character” issue as well using the >>>>>>>>>>>>>>>> OCRB that Shree trained on MRZ lines. For example for a 0 it >>>>>>>>>>>>>>>> will sometimes >>>>>>>>>>>>>>>> add both a 0 and an O to the output , thus outputting 45 >>>>>>>>>>>>>>>> characters total >>>>>>>>>>>>>>>> instead of 44. I haven’t looked at the bounding box output yet >>>>>>>>>>>>>>>> but I >>>>>>>>>>>>>>>> suspect a phantom thin character is added somewhere that I can >>>>>>>>>>>>>>>> discard .. >>>>>>>>>>>>>>>> or maybe two chars will have the same bounding box. If anyone >>>>>>>>>>>>>>>> else has >>>>>>>>>>>>>>>> fixed this issue further up (eg so the output doesn’t contain >>>>>>>>>>>>>>>> the phantom >>>>>>>>>>>>>>>> characters in the first place) id be interested. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago < >>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'll go back to more of training later. Before doing so, >>>>>>>>>>>>>>>>> I'd like to investigate results a little bit. The hocr and >>>>>>>>>>>>>>>>> lstmbox options >>>>>>>>>>>>>>>>> give some details of positions of characters. The results >>>>>>>>>>>>>>>>> show positions >>>>>>>>>>>>>>>>> that perfectly correspond to letters in the image. But the >>>>>>>>>>>>>>>>> text output >>>>>>>>>>>>>>>>> contains a character that obviously does not exist. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Then I found a config file 'lstmdebug' that generates far >>>>>>>>>>>>>>>>> more information. I hope it explains what happened with each >>>>>>>>>>>>>>>>> character. >>>>>>>>>>>>>>>>> I'm yet to read the debug output but I'd appreciate it if >>>>>>>>>>>>>>>>> someone could >>>>>>>>>>>>>>>>> tell me how to read it because it's really complex. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 2019年6月14日金曜日 19時58分49秒 UTC+9 shree: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_MICR >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I have uploaded my files there. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh >>>>>>>>>>>>>>>>>> is the bash script that runs the training. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> You can modify as needed. Please note this is for >>>>>>>>>>>>>>>>>> legacy/base tesseract --oem 0. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago < >>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks a lot, shree. It seems you know everything. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I tried the MICR0.traineddata and the first two >>>>>>>>>>>>>>>>>>> mcr.traineddata. The last one was blocked by the browser. >>>>>>>>>>>>>>>>>>> Each of the >>>>>>>>>>>>>>>>>>> traineddata had mixed results. All of them are getting >>>>>>>>>>>>>>>>>>> symbols fairly good >>>>>>>>>>>>>>>>>>> but getting spaces randomly and reading some numbers wrong. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> MICR0 seems the best among them. Did you suggest that >>>>>>>>>>>>>>>>>>> you'd be able to update it? It gets tripple D very often >>>>>>>>>>>>>>>>>>> where there's >>>>>>>>>>>>>>>>>>> only one, and so on. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Also, I tried to fine tune from MICR0 but I found that I >>>>>>>>>>>>>>>>>>> need to change the language-specific.sh. It specifies some >>>>>>>>>>>>>>>>>>> parameters for >>>>>>>>>>>>>>>>>>> each language. Do you have any guidance for it? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>>>>> http://www.devscope.net/Content/ocrchecks.aspx >>>>>>>>>>>>>>>>>>>> https://github.com/BigPino67/Tesseract-MICR-OCR >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago < >>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> That'll be nice if there's traineddata out there but I >>>>>>>>>>>>>>>>>>>>> didn't find any. I see free fonts and commercial OCR >>>>>>>>>>>>>>>>>>>>> software but not >>>>>>>>>>>>>>>>>>>>> traineddata. Tessdata repository obviously doesn't have >>>>>>>>>>>>>>>>>>>>> one, either. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Please also search for existing MICR traineddata >>>>>>>>>>>>>>>>>>>>>> files. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> So I did several tests from scratch. In the last >>>>>>>>>>>>>>>>>>>>>>> attempt, I made a training text with 4,000 lines in the >>>>>>>>>>>>>>>>>>>>>>> following format, >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> 110004310510< <02 :4002=0181:801= 0008752 <00039 >>>>>>>>>>>>>>>>>>>>>>> ;0000001000; >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> and combined it with eng.digits.training_text in >>>>>>>>>>>>>>>>>>>>>>> which symbols are converted to E13B symbols. This >>>>>>>>>>>>>>>>>>>>>>> makes about 12,000 lines >>>>>>>>>>>>>>>>>>>>>>> of training text. It's amazing that this thing >>>>>>>>>>>>>>>>>>>>>>> generates a good reader out >>>>>>>>>>>>>>>>>>>>>>> of nowhere. But then it is not very good. For example: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134; >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> is a result on the image attached. It's close but >>>>>>>>>>>>>>>>>>>>>>> the last '<' in the result text doesn't exist on the >>>>>>>>>>>>>>>>>>>>>>> image. It's a small >>>>>>>>>>>>>>>>>>>>>>> failure but it causes a greater trouble in parsing. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> What would you suggest from here to increase >>>>>>>>>>>>>>>>>>>>>>> accuracy? >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> - Increase the number of lines in the training >>>>>>>>>>>>>>>>>>>>>>> text >>>>>>>>>>>>>>>>>>>>>>> - Mix up more variations in the training text >>>>>>>>>>>>>>>>>>>>>>> - Increase the number of iterations >>>>>>>>>>>>>>>>>>>>>>> - Investigate wrong reads one by one >>>>>>>>>>>>>>>>>>>>>>> - Or else? >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Also, I referred to engrestrict*.* and could >>>>>>>>>>>>>>>>>>>>>>> generate similar result with the fine-tuning-from-full >>>>>>>>>>>>>>>>>>>>>>> method. It seems a >>>>>>>>>>>>>>>>>>>>>>> bit faster to get to the same level but it also stops >>>>>>>>>>>>>>>>>>>>>>> at a 'good' level. I >>>>>>>>>>>>>>>>>>>>>>> can go with either way if it takes me to the bright >>>>>>>>>>>>>>>>>>>>>>> future. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Thanks a lot, Shree. I'll look it in. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> See >>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Look at the files engrestrict*.* and also >>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Create training text of about 100 lines and >>>>>>>>>>>>>>>>>>>>>>>>> finetune for 400 lines >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> I had about 14 lines as attached. How many lines >>>>>>>>>>>>>>>>>>>>>>>>>> would you recommend? >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Fine tuning gives much better result but it tends >>>>>>>>>>>>>>>>>>>>>>>>>> to pick other character than in E13B that only has >>>>>>>>>>>>>>>>>>>>>>>>>> 14 characters, 0 through >>>>>>>>>>>>>>>>>>>>>>>>>> 9 and 4 symbols. I thought training from scratch >>>>>>>>>>>>>>>>>>>>>>>>>> would eliminate such >>>>>>>>>>>>>>>>>>>>>>>>>> confusion. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> For training from scratch a large training text >>>>>>>>>>>>>>>>>>>>>>>>>>> and hundreds of thousands of iterations are >>>>>>>>>>>>>>>>>>>>>>>>>>> recommended. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> If you are just fine tuning for a font try to >>>>>>>>>>>>>>>>>>>>>>>>>>> follow instructions for training for impact, with >>>>>>>>>>>>>>>>>>>>>>>>>>> your font. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, < >>>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, Shree. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I saw the instruction. The steps I made >>>>>>>>>>>>>>>>>>>>>>>>>>>> are as follows: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Using tesstrain.sh: >>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/tesstrain.sh --fonts_dir >>>>>>>>>>>>>>>>>>>>>>>>>>>> /usr/share/fonts --lang eng --linedata_only \ >>>>>>>>>>>>>>>>>>>>>>>>>>>> --noextract_font_properties --langdata_dir >>>>>>>>>>>>>>>>>>>>>>>>>>>> ../langdata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>> --tessdata_dir ./tessdata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>> --fontlist "E13Bnsd" --output_dir >>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval \ >>>>>>>>>>>>>>>>>>>>>>>>>>>> --training_text >>>>>>>>>>>>>>>>>>>>>>>>>>>> ../langdata/eng/eng.training_e13b_text >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Training from scratch: >>>>>>>>>>>>>>>>>>>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput >>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --debug_interval 100 \ >>>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>> --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 >>>>>>>>>>>>>>>>>>>>>>>>>>>> Lfx96 Lrx96 Lfx256 O1c111]' \ >>>>>>>>>>>>>>>>>>>>>>>>>>>> --model_output ~/tesstutorial/e13boutput/base >>>>>>>>>>>>>>>>>>>>>>>>>>>> --learning_rate 20e-4 \ >>>>>>>>>>>>>>>>>>>>>>>>>>>> --train_listfile >>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \ >>>>>>>>>>>>>>>>>>>>>>>>>>>> --eval_listfile >>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \ >>>>>>>>>>>>>>>>>>>>>>>>>>>> --max_iterations 5000 >>>>>>>>>>>>>>>>>>>>>>>>>>>> &>~/tesstutorial/e13boutput/basetrain.log >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Test with base_checkpoint: >>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmeval --model >>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>> --eval_listfile >>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Combining output files: >>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --stop_training \ >>>>>>>>>>>>>>>>>>>>>>>>>>>> --continue_from >>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>> --model_output >>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/eng.traineddata >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Test with eng.traineddata: >>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract e13b.png out --tessdata-dir >>>>>>>>>>>>>>>>>>>>>>>>>>>> /home/koichi/tesstutorial/e13boutput >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> The training from scratch ended as: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> At iteration 561/2500/2500, Mean rms=0.159%, >>>>>>>>>>>>>>>>>>>>>>>>>>>> delta=0%, char train=0%, word train=0%, skip >>>>>>>>>>>>>>>>>>>>>>>>>>>> ratio=0%, New best char error >>>>>>>>>>>>>>>>>>>>>>>>>>>> = 0 wrote best >>>>>>>>>>>>>>>>>>>>>>>>>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint >>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote >>>>>>>>>>>>>>>>>>>>>>>>>>>> checkpoint. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> The test with base_checkpoint returns nothing >>>>>>>>>>>>>>>>>>>>>>>>>>>> as: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> At iteration 0, stage 0, Eval Char error >>>>>>>>>>>>>>>>>>>>>>>>>>>> rate=0, Word error rate=0 >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> The test with eng.traineddata and e13b.png >>>>>>>>>>>>>>>>>>>>>>>>>>>> returns out.txt. Both files are attached. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Training seems to have worked fine. I don't >>>>>>>>>>>>>>>>>>>>>>>>>>>> know how to translate the test result from >>>>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint. The generated >>>>>>>>>>>>>>>>>>>>>>>>>>>> eng.traineddata obviously doesn't work well. I >>>>>>>>>>>>>>>>>>>>>>>>>>>> suspect the choice of >>>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata in combining output files is bad but >>>>>>>>>>>>>>>>>>>>>>>>>>>> I have no clue. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> BTW, I referred to your tess4training in the >>>>>>>>>>>>>>>>>>>>>>>>>>>> process. It helped a lot. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月29日水曜日 19時14分08秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, May 29, 2019 at 3:18 PM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I wish to make a trained data for E13B font. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I read the training tutorial and made a >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint file according to the method in >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Training From Scratch. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Now, how can I make a trained data from the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint file?</di >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> </d >>>>>>>>>> >>>>>>>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/57f7e7c6-68ca-40dc-9a54-d5405c2cc495%40googlegroups.com.