On Friday, August 9, 2019 at 10:40:15 AM UTC+2, ElGato ElMago wrote: > > I added eng.traineddata and LICENSE. I used my account name in the > license file. I don't know if it's appropriate or not. Please tell me if > it's not. > It's ok. Thanks. I'll share our dataset (real life samples) in the coming days.
> > 2019年8月9日金曜日 16時17分41秒 UTC+9 Mamadou: >> >> >> >> On Friday, August 9, 2019 at 7:31:03 AM UTC+2, ElGato ElMago wrote: >>> >>> Here's my sharing on GitHub. Hope it's of any use for somebody. >>> >>> https://github.com/ElMagoElGato/tess_e13b_training >>> >> Thanks for sharing your experience with us. >> Is it possible to share your Tesseract model (xxx.traineddata)? >> We're building a dataset using real life images like what we have already >> done for MRZ ( >> https://github.com/DoubangoTelecom/tesseractMRZ/tree/master/dataset). >> Your model would help us to automated the annotation and will speedup our >> devs. Off course we'll have to manualy correct the annotations but it will >> be faster for us. >> Also, please add a license to your repo so that we know if we have right >> to use it >> >>> >>> >>> 2019年8月8日木曜日 9時35分17秒 UTC+9 ElGato ElMago: >>>> >>>> OK, I'll do so. I need to reorganize naming and so on a little bit. >>>> Will be out there soon. >>>> >>>> 2019年8月7日水曜日 21時11分01秒 UTC+9 Mamadou: >>>>> >>>>> >>>>> >>>>> On Wednesday, August 7, 2019 at 2:36:52 AM UTC+2, ElGato ElMago wrote: >>>>>> >>>>>> HI, >>>>>> >>>>>> I'm thinking of sharing it of course. What is the best way to do >>>>>> it? After all this, the contribution part of mine is only how I >>>>>> prepared >>>>>> the training text. Even that is consist of Shree's text and mine. The >>>>>> instructions and tools I used already exist. >>>>>> >>>>> If you have a Github account just create a repo and publish the data >>>>> and instructions. >>>>> >>>>>> >>>>>> ElMagoElGato >>>>>> >>>>>> 2019年8月7日水曜日 8時20分02秒 UTC+9 Mamadou: >>>>>> >>>>>>> Hello, >>>>>>> Are you planning to release the dataset or models? >>>>>>> I'm working on the same subject and planning to share both under BSD >>>>>>> terms >>>>>>> >>>>>>> On Tuesday, August 6, 2019 at 10:11:40 AM UTC+2, ElGato ElMago wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> FWIW, I got to the point where I can feel happy with the accuracy. >>>>>>>> As the images of the previous post show, the symbols, especially on-us >>>>>>>> symbol and amount symbol, were causing mix-up each other or to another >>>>>>>> character. I added much more more symbols to the training text and >>>>>>>> formed >>>>>>>> words that start with a symbol. One example is as follows: >>>>>>>> >>>>>>>> 9;:;=;<;< <0<1<3<4;6;8;9;:;=; >>>>>>>> >>>>>>>> >>>>>>>> I randomly made 8,000 lines like this. In fine-tuning from eng, >>>>>>>> 5,000 iteration was almost good. Amount symbol still is confused a >>>>>>>> little >>>>>>>> when it's followed by 0. Fine tuning tends to be dragged by small >>>>>>>> particles. I'll have to think of something to make further >>>>>>>> improvement. >>>>>>>> >>>>>>>> Training from scratch produced a bit more stable traineddata. It >>>>>>>> doesn't get confused with symbols so often but tends to generate extra >>>>>>>> spaces. By 10,000 iterations, those spaces are gone and recognition >>>>>>>> became >>>>>>>> very solid. >>>>>>>> >>>>>>>> I thought I might have to do image and box file training but I >>>>>>>> guess it's not needed this time. >>>>>>>> >>>>>>>> ElMagoElGato >>>>>>>> >>>>>>>> 2019年7月26日金曜日 14時08分06秒 UTC+9 ElGato ElMago: >>>>>>>>> >>>>>>>>> HI, >>>>>>>>> >>>>>>>>> Well, I read the description of ScrollView ( >>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging) >>>>>>>>> and it says: >>>>>>>>> >>>>>>>>> To show the characters, deselect DISPLAY/Bounding Boxes, select >>>>>>>>> DISPLAY/Polygonal Approx and then select OTHER/Uniform display. >>>>>>>>> >>>>>>>>> >>>>>>>>> It basically works. But for some reason, it doesn't work on my >>>>>>>>> e13b image and ends up with a blue screen. Anyway, it shows each box >>>>>>>>> separately when a character is consist of multiple boxes. I'd like >>>>>>>>> to show >>>>>>>>> the box for the whole character. ScrollView doesn't do it, at least, >>>>>>>>> yet. >>>>>>>>> I'll do it on my own. >>>>>>>>> >>>>>>>>> ElMagoElGato >>>>>>>>> >>>>>>>>> 2019年7月24日水曜日 14時10分46秒 UTC+9 ElGato ElMago: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I got this result from hocr. This is where one of the phantom >>>>>>>>>> characters comes from. >>>>>>>>>> >>>>>>>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 902 1262 933; >>>>>>>>>> x_conf 98.864532'><</span> >>>>>>>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 904 1281 933; >>>>>>>>>> x_conf 99.018097'>;</span> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> The firs character is the phantom. It starts with the second >>>>>>>>>> character that exists on x axis. The first character only has 3 >>>>>>>>>> points >>>>>>>>>> width. I attach ScrollView screen shots that visualize this. >>>>>>>>>> >>>>>>>>>> [image: 2019-07-24-132643_854x707_scrot.png][image: >>>>>>>>>> 2019-07-24-132800_854x707_scrot.png] >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> There seem to be some more cases to cause phantom characters. >>>>>>>>>> I'll look them in. But I have a trivial question now. I made >>>>>>>>>> ScrollView >>>>>>>>>> show these displays by accidentally clicking Display->Blamer menu. >>>>>>>>>> There >>>>>>>>>> is Bounding Boxes menu below but it ends up showing a blue screen >>>>>>>>>> though it >>>>>>>>>> briefly shows boxes on the way. Can I use this menu at all? It'll >>>>>>>>>> be very >>>>>>>>>> useful. >>>>>>>>>> >>>>>>>>>> [image: 2019-07-24-140739_854x707_scrot.png] >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2019年7月23日火曜日 17時10分36秒 UTC+9 ElGato ElMago: >>>>>>>>>>> >>>>>>>>>>> It's great! Perfect! Thanks a lot! >>>>>>>>>>> >>>>>>>>>>> 2019年7月23日火曜日 10時56分58秒 UTC+9 shree: >>>>>>>>>>>> >>>>>>>>>>>> See https://github.com/tesseract-ocr/tesseract/issues/2580 >>>>>>>>>>>> >>>>>>>>>>>> On Tue, 23 Jul 2019, 06:23 ElGato ElMago, <elmago...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> I read the output of hocr with lstm_choice_mode = 4 as to the >>>>>>>>>>>>> pull request 2554. It shows the candidates for each character >>>>>>>>>>>>> but doesn't >>>>>>>>>>>>> show bounding box of each character. I only shows the box for a >>>>>>>>>>>>> whole word. >>>>>>>>>>>>> >>>>>>>>>>>>> I see bounding boxes of each character in comments of the pull >>>>>>>>>>>>> request 2576. How can I do that? Do I have to look in the >>>>>>>>>>>>> source code and >>>>>>>>>>>>> manipulate such an output on my own? >>>>>>>>>>>>> >>>>>>>>>>>>> 2019年7月19日金曜日 18時40分49秒 UTC+9 ElGato ElMago: >>>>>>>>>>>>> >>>>>>>>>>>>>> Lorenzo, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I haven't been checking psm too much. Will turn to those >>>>>>>>>>>>>> options after I see how it goes with bounding boxes. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Shree, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I see the merges in the git log and also see that new >>>>>>>>>>>>>> option lstm_choice_amount works now. I guess my executable is >>>>>>>>>>>>>> latest >>>>>>>>>>>>>> though I still see the phantom character. Hocr makes huge and >>>>>>>>>>>>>> complex >>>>>>>>>>>>>> output. I'll take some to read it. >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2019年7月19日金曜日 18時20分55秒 UTC+9 Claudiu: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Is there any way to pass bounding boxes to use to the LSTM? >>>>>>>>>>>>>>> We have an algorithm that cleanly gets bounding boxes of MRZ >>>>>>>>>>>>>>> characters. >>>>>>>>>>>>>>> However the results using psm 10 are worse than passing the >>>>>>>>>>>>>>> whole line in. >>>>>>>>>>>>>>> Yet when we pass the whole line in we get these phantom >>>>>>>>>>>>>>> characters. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Should PSM 10 mode work? It often returns “no character” >>>>>>>>>>>>>>> where there clearly is one. I can supply a test case if it is >>>>>>>>>>>>>>> expected to >>>>>>>>>>>>>>> work well. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 11:06 AM ElGato ElMago < >>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Lorenzo, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We both have got the same case. It seems a solution to >>>>>>>>>>>>>>>> this problem would save a lot of people. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Shree, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I pulled the current head of master branch but it doesn't >>>>>>>>>>>>>>>> seem to contain the merges you pointed that have been merged 3 >>>>>>>>>>>>>>>> to 4 days >>>>>>>>>>>>>>>> ago. How can I get them? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2019年7月19日金曜日 17時02分53秒 UTC+9 Lorenzo Blz: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> PSM 7 was a partial solution for my specific case, it >>>>>>>>>>>>>>>>> improved the situation but did not solve it. Also I could not >>>>>>>>>>>>>>>>> use it in >>>>>>>>>>>>>>>>> some other cases. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The proper solution is very likely doing more training >>>>>>>>>>>>>>>>> with more data, some data augmentation might probably help if >>>>>>>>>>>>>>>>> data is >>>>>>>>>>>>>>>>> scarce. >>>>>>>>>>>>>>>>> Also doing less training might help is the training is not >>>>>>>>>>>>>>>>> done correctly. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> There are also similar issues on github: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1465 >>>>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The LSTM engine works like this: it scans the image and >>>>>>>>>>>>>>>>> for each "pixel column" does this: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> M M M M N M M M [BLANK] F F F F >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> (here i report only the highest probability characters) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> In the example above an M is partially seen as an N, this >>>>>>>>>>>>>>>>> is normal, and another step of the algorithm (beam search I >>>>>>>>>>>>>>>>> think) tries to >>>>>>>>>>>>>>>>> aggregate back the correct characters. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I think cases like this: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> M M M N N N M M >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> are what gives the phantom characters. More training >>>>>>>>>>>>>>>>> should reduce the source of the problem or a painful analysis >>>>>>>>>>>>>>>>> of the >>>>>>>>>>>>>>>>> bounding boxes might fix some cases. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I used the attached script for the boxes. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Lorenzo >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Il giorno ven 19 lug 2019 alle ore 07:25 ElGato ElMago < >>>>>>>>>>>>>>>>> elmago...@gmail.com> ha scritto: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Let's call them phantom characters then. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Was psm 7 the solution for the issue 1778? None of the >>>>>>>>>>>>>>>>>> psm option didn't solve my problem though I see different >>>>>>>>>>>>>>>>>> output. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I use tesseract 5.0-alpha mostly but 4.1 showed the same >>>>>>>>>>>>>>>>>> results anyway. How did you get bounding box for each >>>>>>>>>>>>>>>>>> character? Alto and >>>>>>>>>>>>>>>>>> lstmbox only show bbox for a group of characters. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> 2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Phantom characters here for me too: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1778 >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Are you using 4.1? Bounding boxes were fixed in 4.1 >>>>>>>>>>>>>>>>>>> maybe this was also improved. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I wrote some code that uses symbols iterator to discard >>>>>>>>>>>>>>>>>>> symbols that are clearly duplicated: too small, >>>>>>>>>>>>>>>>>>> overlapping, etc. But it >>>>>>>>>>>>>>>>>>> was not easy to make it work decently and it is not 100% >>>>>>>>>>>>>>>>>>> reliable with >>>>>>>>>>>>>>>>>>> false negatives and positives. I cannot share the code and >>>>>>>>>>>>>>>>>>> it is quite ugly >>>>>>>>>>>>>>>>>>> anyway. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Here there is another MRZ model with training data: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> https://github.com/DoubangoTelecom/tesseractMRZ >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Lorenzo >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu < >>>>>>>>>>>>>>>>>>> csaf...@gmail.com> ha scritto: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I’m getting the “phantom character” issue as well using >>>>>>>>>>>>>>>>>>>> the OCRB that Shree trained on MRZ lines. For example for >>>>>>>>>>>>>>>>>>>> a 0 it will >>>>>>>>>>>>>>>>>>>> sometimes add both a 0 and an O to the output , thus >>>>>>>>>>>>>>>>>>>> outputting 45 >>>>>>>>>>>>>>>>>>>> characters total instead of 44. I haven’t looked at the >>>>>>>>>>>>>>>>>>>> bounding box output >>>>>>>>>>>>>>>>>>>> yet but I suspect a phantom thin character is added >>>>>>>>>>>>>>>>>>>> somewhere that I can >>>>>>>>>>>>>>>>>>>> discard .. or maybe two chars will have the same bounding >>>>>>>>>>>>>>>>>>>> box. If anyone >>>>>>>>>>>>>>>>>>>> else has fixed this issue further up (eg so the output >>>>>>>>>>>>>>>>>>>> doesn’t contain the >>>>>>>>>>>>>>>>>>>> phantom characters in the first place) id be interested. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago < >>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I'll go back to more of training later. Before doing >>>>>>>>>>>>>>>>>>>>> so, I'd like to investigate results a little bit. The >>>>>>>>>>>>>>>>>>>>> hocr and lstmbox >>>>>>>>>>>>>>>>>>>>> options give some details of positions of characters. >>>>>>>>>>>>>>>>>>>>> The results show >>>>>>>>>>>>>>>>>>>>> positions that perfectly correspond to letters in the >>>>>>>>>>>>>>>>>>>>> image. But the text >>>>>>>>>>>>>>>>>>>>> output contains a character that obviously does not exist. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Then I found a config file 'lstmdebug' that generates >>>>>>>>>>>>>>>>>>>>> far more information. I hope it explains what happened >>>>>>>>>>>>>>>>>>>>> with each >>>>>>>>>>>>>>>>>>>>> character. I'm yet to read the debug output but I'd >>>>>>>>>>>>>>>>>>>>> appreciate it if >>>>>>>>>>>>>>>>>>>>> someone could tell me how to read it because it's really >>>>>>>>>>>>>>>>>>>>> complex. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> 2019年6月14日金曜日 19時58分49秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_MICR >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I have uploaded my files there. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh >>>>>>>>>>>>>>>>>>>>>> is the bash script that runs the training. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> You can modify as needed. Please note this is for >>>>>>>>>>>>>>>>>>>>>> legacy/base tesseract --oem 0. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Thanks a lot, shree. It seems you know everything. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> I tried the MICR0.traineddata and the first two >>>>>>>>>>>>>>>>>>>>>>> mcr.traineddata. The last one was blocked by the >>>>>>>>>>>>>>>>>>>>>>> browser. Each of the >>>>>>>>>>>>>>>>>>>>>>> traineddata had mixed results. All of them are getting >>>>>>>>>>>>>>>>>>>>>>> symbols fairly good >>>>>>>>>>>>>>>>>>>>>>> but getting spaces randomly and reading some numbers >>>>>>>>>>>>>>>>>>>>>>> wrong. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> MICR0 seems the best among them. Did you suggest >>>>>>>>>>>>>>>>>>>>>>> that you'd be able to update it? It gets tripple D >>>>>>>>>>>>>>>>>>>>>>> very often where >>>>>>>>>>>>>>>>>>>>>>> there's only one, and so on. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Also, I tried to fine tune from MICR0 but I found >>>>>>>>>>>>>>>>>>>>>>> that I need to change the language-specific.sh. It >>>>>>>>>>>>>>>>>>>>>>> specifies some >>>>>>>>>>>>>>>>>>>>>>> parameters for each language. Do you have any guidance >>>>>>>>>>>>>>>>>>>>>>> for it? >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>>>>>>>>> http://www.devscope.net/Content/ocrchecks.aspx >>>>>>>>>>>>>>>>>>>>>>>> https://github.com/BigPino67/Tesseract-MICR-OCR >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> That'll be nice if there's traineddata out there >>>>>>>>>>>>>>>>>>>>>>>>> but I didn't find any. I see free fonts and >>>>>>>>>>>>>>>>>>>>>>>>> commercial OCR software but >>>>>>>>>>>>>>>>>>>>>>>>> not traineddata. Tessdata repository obviously >>>>>>>>>>>>>>>>>>>>>>>>> doesn't have one, either. >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> Please also search for existing MICR traineddata >>>>>>>>>>>>>>>>>>>>>>>>>> files. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> So I did several tests from scratch. In the >>>>>>>>>>>>>>>>>>>>>>>>>>> last attempt, I made a training text with 4,000 >>>>>>>>>>>>>>>>>>>>>>>>>>> lines in the following >>>>>>>>>>>>>>>>>>>>>>>>>>> format, >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> 110004310510< <02 :4002=0181:801= 0008752 >>>>>>>>>>>>>>>>>>>>>>>>>>> <00039 ;0000001000; >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> and combined it with eng.digits.training_text in >>>>>>>>>>>>>>>>>>>>>>>>>>> which symbols are converted to E13B symbols. This >>>>>>>>>>>>>>>>>>>>>>>>>>> makes about 12,000 lines >>>>>>>>>>>>>>>>>>>>>>>>>>> of training text. It's amazing that this thing >>>>>>>>>>>>>>>>>>>>>>>>>>> generates a good reader out >>>>>>>>>>>>>>>>>>>>>>>>>>> of nowhere. But then it is not very good. For >>>>>>>>>>>>>>>>>>>>>>>>>>> example: >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134; >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> is a result on the image attached. It's close >>>>>>>>>>>>>>>>>>>>>>>>>>> but the last '<' in the result text doesn't exist >>>>>>>>>>>>>>>>>>>>>>>>>>> on the image. It's a >>>>>>>>>>>>>>>>>>>>>>>>>>> small failure but it causes a greater trouble in >>>>>>>>>>>>>>>>>>>>>>>>>>> parsing. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> What would you suggest from here to increase >>>>>>>>>>>>>>>>>>>>>>>>>>> accuracy? >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> - Increase the number of lines in the >>>>>>>>>>>>>>>>>>>>>>>>>>> training text >>>>>>>>>>>>>>>>>>>>>>>>>>> - Mix up more variations in the training text >>>>>>>>>>>>>>>>>>>>>>>>>>> - Increase the number of iterations >>>>>>>>>>>>>>>>>>>>>>>>>>> - Investigate wrong reads one by one >>>>>>>>>>>>>>>>>>>>>>>>>>> - Or else? >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Also, I referred to engrestrict*.* and could >>>>>>>>>>>>>>>>>>>>>>>>>>> generate similar result with the >>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning-from-full method. It seems a >>>>>>>>>>>>>>>>>>>>>>>>>>> bit faster to get to the same level but it also >>>>>>>>>>>>>>>>>>>>>>>>>>> stops at a 'good' level. I >>>>>>>>>>>>>>>>>>>>>>>>>>> can go with either way if it takes me to the bright >>>>>>>>>>>>>>>>>>>>>>>>>>> future. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks a lot, Shree. I'll look it in. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> See >>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Look at the files engrestrict*.* and also >>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Create training text of about 100 lines and >>>>>>>>>>>>>>>>>>>>>>>>>>>>> finetune for 400 lines >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I had about 14 lines as attached. How many >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines would you recommend? >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Fine tuning gives much better result but it >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tends to pick other character than in E13B that >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only has 14 characters, 0 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> through 9 and 4 symbols. I thought training >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from scratch would eliminate >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> such confusion. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For training from scratch a large training >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> text and hundreds of thousands of iterations >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are recommended. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If you are just fine tuning for a font try >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to follow instructions for training for impact, >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with your font. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, < >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, Shree. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I saw the instruction. The steps I >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> made are as follows: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Using tesstrain.sh: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/tesstrain.sh --fonts_dir >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /usr/share/fonts --lang eng --linedata_only \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --noextract_font_properties >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --langdata_dir ../langdata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --tessdata_dir ./tessdata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --fontlist "E13Bnsd" --output_dir >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --training_text >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ../langdata/eng/eng.training_e13b_text >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Training from scratch: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --debug_interval >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 100 \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --model_output >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base --learning_rate >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 20e-4 \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --train_listfile >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --eval_listfile >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --max_iterations 5000 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> &>~/tesstutorial/e13boutput/basetrain.log >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Test with base_checkpoint: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmeval --model >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --eval_listfile >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Combining output files: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --stop_training \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --continue_from >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --model_output >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/eng.traineddata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Test with eng.traineddata: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract e13b.png out --tessdata-dir >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /home/koichi/tesstutorial/e13boutput >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The training from scratch ended as: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> At iteration 561/2500/2500, Mean >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> rms=0.159%, delta=0%, char train=0%, word >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> train=0%, skip ratio=0%, New >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> best char error = 0 wrote best >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> model:/home/koichi/tesstutorial/e13 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/77754ce0-ecac-4ec1-9d35-3acaac29508d%40googlegroups.com.