I added eng.traineddata and LICENSE. I used my account name in the license file. I don't know if it's appropriate or not. Please tell me if it's not.
2019年8月9日金曜日 16時17分41秒 UTC+9 Mamadou: > > > > On Friday, August 9, 2019 at 7:31:03 AM UTC+2, ElGato ElMago wrote: >> >> Here's my sharing on GitHub. Hope it's of any use for somebody. >> >> https://github.com/ElMagoElGato/tess_e13b_training >> > Thanks for sharing your experience with us. > Is it possible to share your Tesseract model (xxx.traineddata)? > We're building a dataset using real life images like what we have already > done for MRZ ( > https://github.com/DoubangoTelecom/tesseractMRZ/tree/master/dataset). > Your model would help us to automated the annotation and will speedup our > devs. Off course we'll have to manualy correct the annotations but it will > be faster for us. > Also, please add a license to your repo so that we know if we have right > to use it > >> >> >> 2019年8月8日木曜日 9時35分17秒 UTC+9 ElGato ElMago: >>> >>> OK, I'll do so. I need to reorganize naming and so on a little bit. >>> Will be out there soon. >>> >>> 2019年8月7日水曜日 21時11分01秒 UTC+9 Mamadou: >>>> >>>> >>>> >>>> On Wednesday, August 7, 2019 at 2:36:52 AM UTC+2, ElGato ElMago wrote: >>>>> >>>>> HI, >>>>> >>>>> I'm thinking of sharing it of course. What is the best way to do it? >>>>> After all this, the contribution part of mine is only how I prepared the >>>>> training text. Even that is consist of Shree's text and mine. The >>>>> instructions and tools I used already exist. >>>>> >>>> If you have a Github account just create a repo and publish the data >>>> and instructions. >>>> >>>>> >>>>> ElMagoElGato >>>>> >>>>> 2019年8月7日水曜日 8時20分02秒 UTC+9 Mamadou: >>>>> >>>>>> Hello, >>>>>> Are you planning to release the dataset or models? >>>>>> I'm working on the same subject and planning to share both under BSD >>>>>> terms >>>>>> >>>>>> On Tuesday, August 6, 2019 at 10:11:40 AM UTC+2, ElGato ElMago wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> FWIW, I got to the point where I can feel happy with the accuracy. >>>>>>> As the images of the previous post show, the symbols, especially on-us >>>>>>> symbol and amount symbol, were causing mix-up each other or to another >>>>>>> character. I added much more more symbols to the training text and >>>>>>> formed >>>>>>> words that start with a symbol. One example is as follows: >>>>>>> >>>>>>> 9;:;=;<;< <0<1<3<4;6;8;9;:;=; >>>>>>> >>>>>>> >>>>>>> I randomly made 8,000 lines like this. In fine-tuning from eng, >>>>>>> 5,000 iteration was almost good. Amount symbol still is confused a >>>>>>> little >>>>>>> when it's followed by 0. Fine tuning tends to be dragged by small >>>>>>> particles. I'll have to think of something to make further improvement. >>>>>>> >>>>>>> Training from scratch produced a bit more stable traineddata. It >>>>>>> doesn't get confused with symbols so often but tends to generate extra >>>>>>> spaces. By 10,000 iterations, those spaces are gone and recognition >>>>>>> became >>>>>>> very solid. >>>>>>> >>>>>>> I thought I might have to do image and box file training but I guess >>>>>>> it's not needed this time. >>>>>>> >>>>>>> ElMagoElGato >>>>>>> >>>>>>> 2019年7月26日金曜日 14時08分06秒 UTC+9 ElGato ElMago: >>>>>>>> >>>>>>>> HI, >>>>>>>> >>>>>>>> Well, I read the description of ScrollView ( >>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging) >>>>>>>> and it says: >>>>>>>> >>>>>>>> To show the characters, deselect DISPLAY/Bounding Boxes, select >>>>>>>> DISPLAY/Polygonal Approx and then select OTHER/Uniform display. >>>>>>>> >>>>>>>> >>>>>>>> It basically works. But for some reason, it doesn't work on my >>>>>>>> e13b image and ends up with a blue screen. Anyway, it shows each box >>>>>>>> separately when a character is consist of multiple boxes. I'd like to >>>>>>>> show >>>>>>>> the box for the whole character. ScrollView doesn't do it, at least, >>>>>>>> yet. >>>>>>>> I'll do it on my own. >>>>>>>> >>>>>>>> ElMagoElGato >>>>>>>> >>>>>>>> 2019年7月24日水曜日 14時10分46秒 UTC+9 ElGato ElMago: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> >>>>>>>>> I got this result from hocr. This is where one of the phantom >>>>>>>>> characters comes from. >>>>>>>>> >>>>>>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 902 1262 933; x_conf >>>>>>>>> 98.864532'><</span> >>>>>>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 904 1281 933; x_conf >>>>>>>>> 99.018097'>;</span> >>>>>>>>> >>>>>>>>> >>>>>>>>> The firs character is the phantom. It starts with the second >>>>>>>>> character that exists on x axis. The first character only has 3 >>>>>>>>> points >>>>>>>>> width. I attach ScrollView screen shots that visualize this. >>>>>>>>> >>>>>>>>> [image: 2019-07-24-132643_854x707_scrot.png][image: >>>>>>>>> 2019-07-24-132800_854x707_scrot.png] >>>>>>>>> >>>>>>>>> >>>>>>>>> There seem to be some more cases to cause phantom characters. >>>>>>>>> I'll look them in. But I have a trivial question now. I made >>>>>>>>> ScrollView >>>>>>>>> show these displays by accidentally clicking Display->Blamer menu. >>>>>>>>> There >>>>>>>>> is Bounding Boxes menu below but it ends up showing a blue screen >>>>>>>>> though it >>>>>>>>> briefly shows boxes on the way. Can I use this menu at all? It'll >>>>>>>>> be very >>>>>>>>> useful. >>>>>>>>> >>>>>>>>> [image: 2019-07-24-140739_854x707_scrot.png] >>>>>>>>> >>>>>>>>> >>>>>>>>> 2019年7月23日火曜日 17時10分36秒 UTC+9 ElGato ElMago: >>>>>>>>>> >>>>>>>>>> It's great! Perfect! Thanks a lot! >>>>>>>>>> >>>>>>>>>> 2019年7月23日火曜日 10時56分58秒 UTC+9 shree: >>>>>>>>>>> >>>>>>>>>>> See https://github.com/tesseract-ocr/tesseract/issues/2580 >>>>>>>>>>> >>>>>>>>>>> On Tue, 23 Jul 2019, 06:23 ElGato ElMago, <elmago...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> I read the output of hocr with lstm_choice_mode = 4 as to the >>>>>>>>>>>> pull request 2554. It shows the candidates for each character but >>>>>>>>>>>> doesn't >>>>>>>>>>>> show bounding box of each character. I only shows the box for a >>>>>>>>>>>> whole word. >>>>>>>>>>>> >>>>>>>>>>>> I see bounding boxes of each character in comments of the pull >>>>>>>>>>>> request 2576. How can I do that? Do I have to look in the source >>>>>>>>>>>> code and >>>>>>>>>>>> manipulate such an output on my own? >>>>>>>>>>>> >>>>>>>>>>>> 2019年7月19日金曜日 18時40分49秒 UTC+9 ElGato ElMago: >>>>>>>>>>>> >>>>>>>>>>>>> Lorenzo, >>>>>>>>>>>>> >>>>>>>>>>>>> I haven't been checking psm too much. Will turn to those >>>>>>>>>>>>> options after I see how it goes with bounding boxes. >>>>>>>>>>>>> >>>>>>>>>>>>> Shree, >>>>>>>>>>>>> >>>>>>>>>>>>> I see the merges in the git log and also see that new >>>>>>>>>>>>> option lstm_choice_amount works now. I guess my executable is >>>>>>>>>>>>> latest >>>>>>>>>>>>> though I still see the phantom character. Hocr makes huge and >>>>>>>>>>>>> complex >>>>>>>>>>>>> output. I'll take some to read it. >>>>>>>>>>>>> >>>>>>>>>>>>> 2019年7月19日金曜日 18時20分55秒 UTC+9 Claudiu: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Is there any way to pass bounding boxes to use to the LSTM? >>>>>>>>>>>>>> We have an algorithm that cleanly gets bounding boxes of MRZ >>>>>>>>>>>>>> characters. >>>>>>>>>>>>>> However the results using psm 10 are worse than passing the >>>>>>>>>>>>>> whole line in. >>>>>>>>>>>>>> Yet when we pass the whole line in we get these phantom >>>>>>>>>>>>>> characters. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Should PSM 10 mode work? It often returns “no character” >>>>>>>>>>>>>> where there clearly is one. I can supply a test case if it is >>>>>>>>>>>>>> expected to >>>>>>>>>>>>>> work well. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 11:06 AM ElGato ElMago < >>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Lorenzo, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We both have got the same case. It seems a solution to this >>>>>>>>>>>>>>> problem would save a lot of people. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Shree, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I pulled the current head of master branch but it doesn't >>>>>>>>>>>>>>> seem to contain the merges you pointed that have been merged 3 >>>>>>>>>>>>>>> to 4 days >>>>>>>>>>>>>>> ago. How can I get them? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019年7月19日金曜日 17時02分53秒 UTC+9 Lorenzo Blz: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> PSM 7 was a partial solution for my specific case, it >>>>>>>>>>>>>>>> improved the situation but did not solve it. Also I could not >>>>>>>>>>>>>>>> use it in >>>>>>>>>>>>>>>> some other cases. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The proper solution is very likely doing more training with >>>>>>>>>>>>>>>> more data, some data augmentation might probably help if data >>>>>>>>>>>>>>>> is scarce. >>>>>>>>>>>>>>>> Also doing less training might help is the training is not >>>>>>>>>>>>>>>> done correctly. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> There are also similar issues on github: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1465 >>>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The LSTM engine works like this: it scans the image and for >>>>>>>>>>>>>>>> each "pixel column" does this: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> M M M M N M M M [BLANK] F F F F >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> (here i report only the highest probability characters) >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> In the example above an M is partially seen as an N, this >>>>>>>>>>>>>>>> is normal, and another step of the algorithm (beam search I >>>>>>>>>>>>>>>> think) tries to >>>>>>>>>>>>>>>> aggregate back the correct characters. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I think cases like this: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> M M M N N N M M >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> are what gives the phantom characters. More training should >>>>>>>>>>>>>>>> reduce the source of the problem or a painful analysis of the >>>>>>>>>>>>>>>> bounding >>>>>>>>>>>>>>>> boxes might fix some cases. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I used the attached script for the boxes. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Lorenzo >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Il giorno ven 19 lug 2019 alle ore 07:25 ElGato ElMago < >>>>>>>>>>>>>>>> elmago...@gmail.com> ha scritto: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Let's call them phantom characters then. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Was psm 7 the solution for the issue 1778? None of the >>>>>>>>>>>>>>>>> psm option didn't solve my problem though I see different >>>>>>>>>>>>>>>>> output. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I use tesseract 5.0-alpha mostly but 4.1 showed the same >>>>>>>>>>>>>>>>> results anyway. How did you get bounding box for each >>>>>>>>>>>>>>>>> character? Alto and >>>>>>>>>>>>>>>>> lstmbox only show bbox for a group of characters. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Phantom characters here for me too: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1778 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Are you using 4.1? Bounding boxes were fixed in 4.1 maybe >>>>>>>>>>>>>>>>>> this was also improved. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I wrote some code that uses symbols iterator to discard >>>>>>>>>>>>>>>>>> symbols that are clearly duplicated: too small, overlapping, >>>>>>>>>>>>>>>>>> etc. But it >>>>>>>>>>>>>>>>>> was not easy to make it work decently and it is not 100% >>>>>>>>>>>>>>>>>> reliable with >>>>>>>>>>>>>>>>>> false negatives and positives. I cannot share the code and >>>>>>>>>>>>>>>>>> it is quite ugly >>>>>>>>>>>>>>>>>> anyway. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Here there is another MRZ model with training data: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> https://github.com/DoubangoTelecom/tesseractMRZ >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Lorenzo >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu < >>>>>>>>>>>>>>>>>> csaf...@gmail.com> ha scritto: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I’m getting the “phantom character” issue as well using >>>>>>>>>>>>>>>>>>> the OCRB that Shree trained on MRZ lines. For example for a >>>>>>>>>>>>>>>>>>> 0 it will >>>>>>>>>>>>>>>>>>> sometimes add both a 0 and an O to the output , thus >>>>>>>>>>>>>>>>>>> outputting 45 >>>>>>>>>>>>>>>>>>> characters total instead of 44. I haven’t looked at the >>>>>>>>>>>>>>>>>>> bounding box output >>>>>>>>>>>>>>>>>>> yet but I suspect a phantom thin character is added >>>>>>>>>>>>>>>>>>> somewhere that I can >>>>>>>>>>>>>>>>>>> discard .. or maybe two chars will have the same bounding >>>>>>>>>>>>>>>>>>> box. If anyone >>>>>>>>>>>>>>>>>>> else has fixed this issue further up (eg so the output >>>>>>>>>>>>>>>>>>> doesn’t contain the >>>>>>>>>>>>>>>>>>> phantom characters in the first place) id be interested. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago < >>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I'll go back to more of training later. Before doing >>>>>>>>>>>>>>>>>>>> so, I'd like to investigate results a little bit. The >>>>>>>>>>>>>>>>>>>> hocr and lstmbox >>>>>>>>>>>>>>>>>>>> options give some details of positions of characters. The >>>>>>>>>>>>>>>>>>>> results show >>>>>>>>>>>>>>>>>>>> positions that perfectly correspond to letters in the >>>>>>>>>>>>>>>>>>>> image. But the text >>>>>>>>>>>>>>>>>>>> output contains a character that obviously does not exist. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Then I found a config file 'lstmdebug' that generates >>>>>>>>>>>>>>>>>>>> far more information. I hope it explains what happened >>>>>>>>>>>>>>>>>>>> with each >>>>>>>>>>>>>>>>>>>> character. I'm yet to read the debug output but I'd >>>>>>>>>>>>>>>>>>>> appreciate it if >>>>>>>>>>>>>>>>>>>> someone could tell me how to read it because it's really >>>>>>>>>>>>>>>>>>>> complex. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> 2019年6月14日金曜日 19時58分49秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_MICR >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I have uploaded my files there. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh >>>>>>>>>>>>>>>>>>>>> is the bash script that runs the training. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> You can modify as needed. Please note this is for >>>>>>>>>>>>>>>>>>>>> legacy/base tesseract --oem 0. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Thanks a lot, shree. It seems you know everything. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I tried the MICR0.traineddata and the first two >>>>>>>>>>>>>>>>>>>>>> mcr.traineddata. The last one was blocked by the >>>>>>>>>>>>>>>>>>>>>> browser. Each of the >>>>>>>>>>>>>>>>>>>>>> traineddata had mixed results. All of them are getting >>>>>>>>>>>>>>>>>>>>>> symbols fairly good >>>>>>>>>>>>>>>>>>>>>> but getting spaces randomly and reading some numbers >>>>>>>>>>>>>>>>>>>>>> wrong. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> MICR0 seems the best among them. Did you suggest >>>>>>>>>>>>>>>>>>>>>> that you'd be able to update it? It gets tripple D very >>>>>>>>>>>>>>>>>>>>>> often where >>>>>>>>>>>>>>>>>>>>>> there's only one, and so on. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Also, I tried to fine tune from MICR0 but I found >>>>>>>>>>>>>>>>>>>>>> that I need to change the language-specific.sh. It >>>>>>>>>>>>>>>>>>>>>> specifies some >>>>>>>>>>>>>>>>>>>>>> parameters for each language. Do you have any guidance >>>>>>>>>>>>>>>>>>>>>> for it? >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>>>>>>>> http://www.devscope.net/Content/ocrchecks.aspx >>>>>>>>>>>>>>>>>>>>>>> https://github.com/BigPino67/Tesseract-MICR-OCR >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> That'll be nice if there's traineddata out there >>>>>>>>>>>>>>>>>>>>>>>> but I didn't find any. I see free fonts and >>>>>>>>>>>>>>>>>>>>>>>> commercial OCR software but >>>>>>>>>>>>>>>>>>>>>>>> not traineddata. Tessdata repository obviously >>>>>>>>>>>>>>>>>>>>>>>> doesn't have one, either. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Please also search for existing MICR traineddata >>>>>>>>>>>>>>>>>>>>>>>>> files. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> So I did several tests from scratch. In the last >>>>>>>>>>>>>>>>>>>>>>>>>> attempt, I made a training text with 4,000 lines in >>>>>>>>>>>>>>>>>>>>>>>>>> the following format, >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> 110004310510< <02 :4002=0181:801= 0008752 >>>>>>>>>>>>>>>>>>>>>>>>>> <00039 ;0000001000; >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> and combined it with eng.digits.training_text in >>>>>>>>>>>>>>>>>>>>>>>>>> which symbols are converted to E13B symbols. This >>>>>>>>>>>>>>>>>>>>>>>>>> makes about 12,000 lines >>>>>>>>>>>>>>>>>>>>>>>>>> of training text. It's amazing that this thing >>>>>>>>>>>>>>>>>>>>>>>>>> generates a good reader out >>>>>>>>>>>>>>>>>>>>>>>>>> of nowhere. But then it is not very good. For >>>>>>>>>>>>>>>>>>>>>>>>>> example: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134; >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> is a result on the image attached. It's close >>>>>>>>>>>>>>>>>>>>>>>>>> but the last '<' in the result text doesn't exist on >>>>>>>>>>>>>>>>>>>>>>>>>> the image. It's a >>>>>>>>>>>>>>>>>>>>>>>>>> small failure but it causes a greater trouble in >>>>>>>>>>>>>>>>>>>>>>>>>> parsing. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> What would you suggest from here to increase >>>>>>>>>>>>>>>>>>>>>>>>>> accuracy? >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> - Increase the number of lines in the >>>>>>>>>>>>>>>>>>>>>>>>>> training text >>>>>>>>>>>>>>>>>>>>>>>>>> - Mix up more variations in the training text >>>>>>>>>>>>>>>>>>>>>>>>>> - Increase the number of iterations >>>>>>>>>>>>>>>>>>>>>>>>>> - Investigate wrong reads one by one >>>>>>>>>>>>>>>>>>>>>>>>>> - Or else? >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Also, I referred to engrestrict*.* and could >>>>>>>>>>>>>>>>>>>>>>>>>> generate similar result with the >>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning-from-full method. It seems a >>>>>>>>>>>>>>>>>>>>>>>>>> bit faster to get to the same level but it also >>>>>>>>>>>>>>>>>>>>>>>>>> stops at a 'good' level. I >>>>>>>>>>>>>>>>>>>>>>>>>> can go with either way if it takes me to the bright >>>>>>>>>>>>>>>>>>>>>>>>>> future. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago: >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks a lot, Shree. I'll look it in. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> See >>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Look at the files engrestrict*.* and also >>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Create training text of about 100 lines and >>>>>>>>>>>>>>>>>>>>>>>>>>>> finetune for 400 lines >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> I had about 14 lines as attached. How many >>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines would you recommend? >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Fine tuning gives much better result but it >>>>>>>>>>>>>>>>>>>>>>>>>>>>> tends to pick other character than in E13B that >>>>>>>>>>>>>>>>>>>>>>>>>>>>> only has 14 characters, 0 >>>>>>>>>>>>>>>>>>>>>>>>>>>>> through 9 and 4 symbols. I thought training from >>>>>>>>>>>>>>>>>>>>>>>>>>>>> scratch would eliminate >>>>>>>>>>>>>>>>>>>>>>>>>>>>> such confusion. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For training from scratch a large training >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> text and hundreds of thousands of iterations are >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> recommended. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If you are just fine tuning for a font try to >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> follow instructions for training for impact, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with your font. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, < >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, Shree. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I saw the instruction. The steps I >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> made are as follows: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Using tesstrain.sh: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/tesstrain.sh --fonts_dir >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /usr/share/fonts --lang eng --linedata_only \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --noextract_font_properties --langdata_dir >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ../langdata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --tessdata_dir ./tessdata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --fontlist "E13Bnsd" --output_dir >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --training_text >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ../langdata/eng/eng.training_e13b_text >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Training from scratch: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --debug_interval >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 100 \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --model_output >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base --learning_rate >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 20e-4 \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --train_listfile >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --eval_listfile >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --max_iterations 5000 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> &>~/tesstutorial/e13boutput/basetrain.log >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Test with base_checkpoint: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmeval --model >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --eval_listfile >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Combining output files: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --stop_training \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --continue_from >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --model_output >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/eng.traineddata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Test with eng.traineddata: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract e13b.png out --tessdata-dir >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /home/koichi/tesstutorial/e13boutput >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The training from scratch ended as: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> At iteration 561/2500/2500, Mean rms=0.159%, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> delta=0%, char train=0%, word train=0%, skip >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ratio=0%, New best char error >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> = 0 wrote best >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> wrote >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> checkpoint. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The test with base_checkpoint returns >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> nothing as: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> At iteration 0, stage 0, Eval Char error >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> rate=0, Word error rate=0 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The test with eng.traineddata and e13b.png >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> returns out.txt. Both files are attached. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Training seems to have worked fine. I don't >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know how to translate the test result from >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint. The generated >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> eng.traineddata obviously doesn't work well. I >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> suspect the choice of >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata in combining output files is bad >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but I have no clue. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> BTW, I referred to your tess4training in the >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> process. It helped a lot. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月29日水曜日 19 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7a0cc7c1-2b03-4609-96ba-6ef89a79ed50%40googlegroups.com.