I suggest to rename the traineddata file from eng. to e13b or another similar descriptive name and also add a link to it in the data file contributions wiki page.
On Fri, 9 Aug 2019, 20:08 'Mamadou' via tesseract-ocr, < tesseract-ocr@googlegroups.com> wrote: > > > On Friday, August 9, 2019 at 10:40:15 AM UTC+2, ElGato ElMago wrote: >> >> I added eng.traineddata and LICENSE. I used my account name in the >> license file. I don't know if it's appropriate or not. Please tell me if >> it's not. >> > It's ok. > Thanks. I'll share our dataset (real life samples) in the coming days. > >> >> 2019年8月9日金曜日 16時17分41秒 UTC+9 Mamadou: >>> >>> >>> >>> On Friday, August 9, 2019 at 7:31:03 AM UTC+2, ElGato ElMago wrote: >>>> >>>> Here's my sharing on GitHub. Hope it's of any use for somebody. >>>> >>>> https://github.com/ElMagoElGato/tess_e13b_training >>>> >>> Thanks for sharing your experience with us. >>> Is it possible to share your Tesseract model (xxx.traineddata)? >>> We're building a dataset using real life images like what we have >>> already done for MRZ ( >>> https://github.com/DoubangoTelecom/tesseractMRZ/tree/master/dataset). >>> Your model would help us to automated the annotation and will speedup >>> our devs. Off course we'll have to manualy correct the annotations but it >>> will be faster for us. >>> Also, please add a license to your repo so that we know if we have right >>> to use it >>> >>>> >>>> >>>> 2019年8月8日木曜日 9時35分17秒 UTC+9 ElGato ElMago: >>>>> >>>>> OK, I'll do so. I need to reorganize naming and so on a little bit. >>>>> Will be out there soon. >>>>> >>>>> 2019年8月7日水曜日 21時11分01秒 UTC+9 Mamadou: >>>>>> >>>>>> >>>>>> >>>>>> On Wednesday, August 7, 2019 at 2:36:52 AM UTC+2, ElGato ElMago wrote: >>>>>>> >>>>>>> HI, >>>>>>> >>>>>>> I'm thinking of sharing it of course. What is the best way to do >>>>>>> it? After all this, the contribution part of mine is only how I >>>>>>> prepared >>>>>>> the training text. Even that is consist of Shree's text and mine. The >>>>>>> instructions and tools I used already exist. >>>>>>> >>>>>> If you have a Github account just create a repo and publish the data >>>>>> and instructions. >>>>>> >>>>>>> >>>>>>> ElMagoElGato >>>>>>> >>>>>>> 2019年8月7日水曜日 8時20分02秒 UTC+9 Mamadou: >>>>>>> >>>>>>>> Hello, >>>>>>>> Are you planning to release the dataset or models? >>>>>>>> I'm working on the same subject and planning to share both under >>>>>>>> BSD terms >>>>>>>> >>>>>>>> On Tuesday, August 6, 2019 at 10:11:40 AM UTC+2, ElGato ElMago >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> FWIW, I got to the point where I can feel happy with the accuracy. >>>>>>>>> As the images of the previous post show, the symbols, especially on-us >>>>>>>>> symbol and amount symbol, were causing mix-up each other or to another >>>>>>>>> character. I added much more more symbols to the training text and >>>>>>>>> formed >>>>>>>>> words that start with a symbol. One example is as follows: >>>>>>>>> >>>>>>>>> 9;:;=;<;< <0<1<3<4;6;8;9;:;=; >>>>>>>>> >>>>>>>>> >>>>>>>>> I randomly made 8,000 lines like this. In fine-tuning from eng, >>>>>>>>> 5,000 iteration was almost good. Amount symbol still is confused a >>>>>>>>> little >>>>>>>>> when it's followed by 0. Fine tuning tends to be dragged by small >>>>>>>>> particles. I'll have to think of something to make further >>>>>>>>> improvement. >>>>>>>>> >>>>>>>>> Training from scratch produced a bit more stable traineddata. It >>>>>>>>> doesn't get confused with symbols so often but tends to generate extra >>>>>>>>> spaces. By 10,000 iterations, those spaces are gone and recognition >>>>>>>>> became >>>>>>>>> very solid. >>>>>>>>> >>>>>>>>> I thought I might have to do image and box file training but I >>>>>>>>> guess it's not needed this time. >>>>>>>>> >>>>>>>>> ElMagoElGato >>>>>>>>> >>>>>>>>> 2019年7月26日金曜日 14時08分06秒 UTC+9 ElGato ElMago: >>>>>>>>>> >>>>>>>>>> HI, >>>>>>>>>> >>>>>>>>>> Well, I read the description of ScrollView ( >>>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging) >>>>>>>>>> and it says: >>>>>>>>>> >>>>>>>>>> To show the characters, deselect DISPLAY/Bounding Boxes, select >>>>>>>>>> DISPLAY/Polygonal Approx and then select OTHER/Uniform display. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> It basically works. But for some reason, it doesn't work on my >>>>>>>>>> e13b image and ends up with a blue screen. Anyway, it shows each box >>>>>>>>>> separately when a character is consist of multiple boxes. I'd like >>>>>>>>>> to show >>>>>>>>>> the box for the whole character. ScrollView doesn't do it, at >>>>>>>>>> least, yet. >>>>>>>>>> I'll do it on my own. >>>>>>>>>> >>>>>>>>>> ElMagoElGato >>>>>>>>>> >>>>>>>>>> 2019年7月24日水曜日 14時10分46秒 UTC+9 ElGato ElMago: >>>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I got this result from hocr. This is where one of the phantom >>>>>>>>>>> characters comes from. >>>>>>>>>>> >>>>>>>>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 902 1262 933; >>>>>>>>>>> x_conf 98.864532'><</span> >>>>>>>>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 904 1281 933; >>>>>>>>>>> x_conf 99.018097'>;</span> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> The firs character is the phantom. It starts with the second >>>>>>>>>>> character that exists on x axis. The first character only has 3 >>>>>>>>>>> points >>>>>>>>>>> width. I attach ScrollView screen shots that visualize this. >>>>>>>>>>> >>>>>>>>>>> [image: 2019-07-24-132643_854x707_scrot.png][image: >>>>>>>>>>> 2019-07-24-132800_854x707_scrot.png] >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> There seem to be some more cases to cause phantom characters. >>>>>>>>>>> I'll look them in. But I have a trivial question now. I made >>>>>>>>>>> ScrollView >>>>>>>>>>> show these displays by accidentally clicking Display->Blamer menu. >>>>>>>>>>> There >>>>>>>>>>> is Bounding Boxes menu below but it ends up showing a blue screen >>>>>>>>>>> though it >>>>>>>>>>> briefly shows boxes on the way. Can I use this menu at all? It'll >>>>>>>>>>> be very >>>>>>>>>>> useful. >>>>>>>>>>> >>>>>>>>>>> [image: 2019-07-24-140739_854x707_scrot.png] >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 2019年7月23日火曜日 17時10分36秒 UTC+9 ElGato ElMago: >>>>>>>>>>>> >>>>>>>>>>>> It's great! Perfect! Thanks a lot! >>>>>>>>>>>> >>>>>>>>>>>> 2019年7月23日火曜日 10時56分58秒 UTC+9 shree: >>>>>>>>>>>>> >>>>>>>>>>>>> See https://github.com/tesseract-ocr/tesseract/issues/2580 >>>>>>>>>>>>> >>>>>>>>>>>>> On Tue, 23 Jul 2019, 06:23 ElGato ElMago, <elmago...@gmail.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>> >>>>>>>>>>>>>> I read the output of hocr with lstm_choice_mode = 4 as to the >>>>>>>>>>>>>> pull request 2554. It shows the candidates for each character >>>>>>>>>>>>>> but doesn't >>>>>>>>>>>>>> show bounding box of each character. I only shows the box for a >>>>>>>>>>>>>> whole word. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I see bounding boxes of each character in comments of the >>>>>>>>>>>>>> pull request 2576. How can I do that? Do I have to look in the >>>>>>>>>>>>>> source >>>>>>>>>>>>>> code and manipulate such an output on my own? >>>>>>>>>>>>>> >>>>>>>>>>>>>> 2019年7月19日金曜日 18時40分49秒 UTC+9 ElGato ElMago: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Lorenzo, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I haven't been checking psm too much. Will turn to those >>>>>>>>>>>>>>> options after I see how it goes with bounding boxes. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Shree, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I see the merges in the git log and also see that new >>>>>>>>>>>>>>> option lstm_choice_amount works now. I guess my executable is >>>>>>>>>>>>>>> latest >>>>>>>>>>>>>>> though I still see the phantom character. Hocr makes huge and >>>>>>>>>>>>>>> complex >>>>>>>>>>>>>>> output. I'll take some to read it. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2019年7月19日金曜日 18時20分55秒 UTC+9 Claudiu: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Is there any way to pass bounding boxes to use to the LSTM? >>>>>>>>>>>>>>>> We have an algorithm that cleanly gets bounding boxes of MRZ >>>>>>>>>>>>>>>> characters. >>>>>>>>>>>>>>>> However the results using psm 10 are worse than passing the >>>>>>>>>>>>>>>> whole line in. >>>>>>>>>>>>>>>> Yet when we pass the whole line in we get these phantom >>>>>>>>>>>>>>>> characters. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Should PSM 10 mode work? It often returns “no character” >>>>>>>>>>>>>>>> where there clearly is one. I can supply a test case if it is >>>>>>>>>>>>>>>> expected to >>>>>>>>>>>>>>>> work well. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 11:06 AM ElGato ElMago < >>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Lorenzo, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> We both have got the same case. It seems a solution to >>>>>>>>>>>>>>>>> this problem would save a lot of people. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Shree, >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I pulled the current head of master branch but it doesn't >>>>>>>>>>>>>>>>> seem to contain the merges you pointed that have been merged >>>>>>>>>>>>>>>>> 3 to 4 days >>>>>>>>>>>>>>>>> ago. How can I get them? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> 2019年7月19日金曜日 17時02分53秒 UTC+9 Lorenzo Blz: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> PSM 7 was a partial solution for my specific case, it >>>>>>>>>>>>>>>>>> improved the situation but did not solve it. Also I could >>>>>>>>>>>>>>>>>> not use it in >>>>>>>>>>>>>>>>>> some other cases. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The proper solution is very likely doing more training >>>>>>>>>>>>>>>>>> with more data, some data augmentation might probably help >>>>>>>>>>>>>>>>>> if data is >>>>>>>>>>>>>>>>>> scarce. >>>>>>>>>>>>>>>>>> Also doing less training might help is the training is >>>>>>>>>>>>>>>>>> not done correctly. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> There are also similar issues on github: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1465 >>>>>>>>>>>>>>>>>> ... >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The LSTM engine works like this: it scans the image and >>>>>>>>>>>>>>>>>> for each "pixel column" does this: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> M M M M N M M M [BLANK] F F F F >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> (here i report only the highest probability characters) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> In the example above an M is partially seen as an N, this >>>>>>>>>>>>>>>>>> is normal, and another step of the algorithm (beam search I >>>>>>>>>>>>>>>>>> think) tries to >>>>>>>>>>>>>>>>>> aggregate back the correct characters. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I think cases like this: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> M M M N N N M M >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> are what gives the phantom characters. More training >>>>>>>>>>>>>>>>>> should reduce the source of the problem or a painful >>>>>>>>>>>>>>>>>> analysis of the >>>>>>>>>>>>>>>>>> bounding boxes might fix some cases. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I used the attached script for the boxes. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Lorenzo >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Il giorno ven 19 lug 2019 alle ore 07:25 ElGato ElMago < >>>>>>>>>>>>>>>>>> elmago...@gmail.com> ha scritto: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Let's call them phantom characters then. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Was psm 7 the solution for the issue 1778? None of the >>>>>>>>>>>>>>>>>>> psm option didn't solve my problem though I see different >>>>>>>>>>>>>>>>>>> output. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I use tesseract 5.0-alpha mostly but 4.1 showed the same >>>>>>>>>>>>>>>>>>> results anyway. How did you get bounding box for each >>>>>>>>>>>>>>>>>>> character? Alto and >>>>>>>>>>>>>>>>>>> lstmbox only show bbox for a group of characters. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Phantom characters here for me too: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1778 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Are you using 4.1? Bounding boxes were fixed in 4.1 >>>>>>>>>>>>>>>>>>>> maybe this was also improved. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I wrote some code that uses symbols iterator to discard >>>>>>>>>>>>>>>>>>>> symbols that are clearly duplicated: too small, >>>>>>>>>>>>>>>>>>>> overlapping, etc. But it >>>>>>>>>>>>>>>>>>>> was not easy to make it work decently and it is not 100% >>>>>>>>>>>>>>>>>>>> reliable with >>>>>>>>>>>>>>>>>>>> false negatives and positives. I cannot share the code and >>>>>>>>>>>>>>>>>>>> it is quite ugly >>>>>>>>>>>>>>>>>>>> anyway. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Here there is another MRZ model with training data: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> https://github.com/DoubangoTelecom/tesseractMRZ >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Lorenzo >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu < >>>>>>>>>>>>>>>>>>>> csaf...@gmail.com> ha scritto: >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I’m getting the “phantom character” issue as well >>>>>>>>>>>>>>>>>>>>> using the OCRB that Shree trained on MRZ lines. For >>>>>>>>>>>>>>>>>>>>> example for a 0 it will >>>>>>>>>>>>>>>>>>>>> sometimes add both a 0 and an O to the output , thus >>>>>>>>>>>>>>>>>>>>> outputting 45 >>>>>>>>>>>>>>>>>>>>> characters total instead of 44. I haven’t looked at the >>>>>>>>>>>>>>>>>>>>> bounding box output >>>>>>>>>>>>>>>>>>>>> yet but I suspect a phantom thin character is added >>>>>>>>>>>>>>>>>>>>> somewhere that I can >>>>>>>>>>>>>>>>>>>>> discard .. or maybe two chars will have the same bounding >>>>>>>>>>>>>>>>>>>>> box. If anyone >>>>>>>>>>>>>>>>>>>>> else has fixed this issue further up (eg so the output >>>>>>>>>>>>>>>>>>>>> doesn’t contain the >>>>>>>>>>>>>>>>>>>>> phantom characters in the first place) id be interested. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> I'll go back to more of training later. Before doing >>>>>>>>>>>>>>>>>>>>>> so, I'd like to investigate results a little bit. The >>>>>>>>>>>>>>>>>>>>>> hocr and lstmbox >>>>>>>>>>>>>>>>>>>>>> options give some details of positions of characters. >>>>>>>>>>>>>>>>>>>>>> The results show >>>>>>>>>>>>>>>>>>>>>> positions that perfectly correspond to letters in the >>>>>>>>>>>>>>>>>>>>>> image. But the text >>>>>>>>>>>>>>>>>>>>>> output contains a character that obviously does not >>>>>>>>>>>>>>>>>>>>>> exist. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Then I found a config file 'lstmdebug' that generates >>>>>>>>>>>>>>>>>>>>>> far more information. I hope it explains what happened >>>>>>>>>>>>>>>>>>>>>> with each >>>>>>>>>>>>>>>>>>>>>> character. I'm yet to read the debug output but I'd >>>>>>>>>>>>>>>>>>>>>> appreciate it if >>>>>>>>>>>>>>>>>>>>>> someone could tell me how to read it because it's really >>>>>>>>>>>>>>>>>>>>>> complex. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> 2019年6月14日金曜日 19時58分49秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_MICR >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> I have uploaded my files there. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh >>>>>>>>>>>>>>>>>>>>>>> is the bash script that runs the training. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> You can modify as needed. Please note this is for >>>>>>>>>>>>>>>>>>>>>>> legacy/base tesseract --oem 0. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Thanks a lot, shree. It seems you know everything. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> I tried the MICR0.traineddata and the first two >>>>>>>>>>>>>>>>>>>>>>>> mcr.traineddata. The last one was blocked by the >>>>>>>>>>>>>>>>>>>>>>>> browser. Each of the >>>>>>>>>>>>>>>>>>>>>>>> traineddata had mixed results. All of them are >>>>>>>>>>>>>>>>>>>>>>>> getting symbols fairly good >>>>>>>>>>>>>>>>>>>>>>>> but getting spaces randomly and reading some numbers >>>>>>>>>>>>>>>>>>>>>>>> wrong. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> MICR0 seems the best among them. Did you suggest >>>>>>>>>>>>>>>>>>>>>>>> that you'd be able to update it? It gets tripple D >>>>>>>>>>>>>>>>>>>>>>>> very often where >>>>>>>>>>>>>>>>>>>>>>>> there's only one, and so on. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Also, I tried to fine tune from MICR0 but I found >>>>>>>>>>>>>>>>>>>>>>>> that I need to change the language-specific.sh. It >>>>>>>>>>>>>>>>>>>>>>>> specifies some >>>>>>>>>>>>>>>>>>>>>>>> parameters for each language. Do you have any >>>>>>>>>>>>>>>>>>>>>>>> guidance for it? >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> see >>>>>>>>>>>>>>>>>>>>>>>>> http://www.devscope.net/Content/ocrchecks.aspx >>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/BigPino67/Tesseract-MICR-OCR >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> That'll be nice if there's traineddata out there >>>>>>>>>>>>>>>>>>>>>>>>>> but I didn't find any. I see free fonts and >>>>>>>>>>>>>>>>>>>>>>>>>> commercial OCR software but >>>>>>>>>>>>>>>>>>>>>>>>>> not traineddata. Tessdata repository obviously >>>>>>>>>>>>>>>>>>>>>>>>>> doesn't have one, either. >>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> Please also search for existing MICR traineddata >>>>>>>>>>>>>>>>>>>>>>>>>>> files. >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago < >>>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> So I did several tests from scratch. In the >>>>>>>>>>>>>>>>>>>>>>>>>>>> last attempt, I made a training text with 4,000 >>>>>>>>>>>>>>>>>>>>>>>>>>>> lines in the following >>>>>>>>>>>>>>>>>>>>>>>>>>>> format, >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> 110004310510< <02 :4002=0181:801= 0008752 >>>>>>>>>>>>>>>>>>>>>>>>>>>> <00039 ;0000001000; >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> and combined it with eng.digits.training_text >>>>>>>>>>>>>>>>>>>>>>>>>>>> in which symbols are converted to E13B symbols. >>>>>>>>>>>>>>>>>>>>>>>>>>>> This makes about 12,000 >>>>>>>>>>>>>>>>>>>>>>>>>>>> lines of training text. It's amazing that this >>>>>>>>>>>>>>>>>>>>>>>>>>>> thing generates a good >>>>>>>>>>>>>>>>>>>>>>>>>>>> reader out of nowhere. But then it is not very >>>>>>>>>>>>>>>>>>>>>>>>>>>> good. For example: >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134; >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> is a result on the image attached. It's close >>>>>>>>>>>>>>>>>>>>>>>>>>>> but the last '<' in the result text doesn't exist >>>>>>>>>>>>>>>>>>>>>>>>>>>> on the image. It's a >>>>>>>>>>>>>>>>>>>>>>>>>>>> small failure but it causes a greater trouble in >>>>>>>>>>>>>>>>>>>>>>>>>>>> parsing. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> What would you suggest from here to increase >>>>>>>>>>>>>>>>>>>>>>>>>>>> accuracy? >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> - Increase the number of lines in the >>>>>>>>>>>>>>>>>>>>>>>>>>>> training text >>>>>>>>>>>>>>>>>>>>>>>>>>>> - Mix up more variations in the training >>>>>>>>>>>>>>>>>>>>>>>>>>>> text >>>>>>>>>>>>>>>>>>>>>>>>>>>> - Increase the number of iterations >>>>>>>>>>>>>>>>>>>>>>>>>>>> - Investigate wrong reads one by one >>>>>>>>>>>>>>>>>>>>>>>>>>>> - Or else? >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Also, I referred to engrestrict*.* and could >>>>>>>>>>>>>>>>>>>>>>>>>>>> generate similar result with the >>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning-from-full method. It seems a >>>>>>>>>>>>>>>>>>>>>>>>>>>> bit faster to get to the same level but it also >>>>>>>>>>>>>>>>>>>>>>>>>>>> stops at a 'good' level. I >>>>>>>>>>>>>>>>>>>>>>>>>>>> can go with either way if it takes me to the >>>>>>>>>>>>>>>>>>>>>>>>>>>> bright future. >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato >>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks a lot, Shree. I'll look it in. >>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> See >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Look at the files engrestrict*.* and also >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Create training text of about 100 lines and >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> finetune for 400 lines >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> <elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I had about 14 lines as attached. How many >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines would you recommend? >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Fine tuning gives much better result but it >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tends to pick other character than in E13B that >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only has 14 characters, 0 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> through 9 and 4 symbols. I thought training >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from scratch would eliminate >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> such confusion. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For training from scratch a large training >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> text and hundreds of thousands of iterations >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are recommended. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If you are just fine tuning for a font try >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to follow instructions for training for >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> impact, with your font. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, < >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, Shree. >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I saw the instruction. The steps I >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> made are as follows: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Using tesstrain.sh: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/tesstrain.sh --fonts_dir >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /usr/share/fonts --lang eng --linedata_only \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --noextract_font_properties >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --langdata_dir ../langdata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --tessdata_dir ./tessdata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --fontlist "E13Bnsd" --output_dir >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --training_text >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ../langdata/eng/eng.training_e13b_text >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Training from scratch: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --debug_interval >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 100 \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --model_output >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --learning_rate 20e-4 \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --train_listfile >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --eval_listfile >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --max_iterations 5000 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> &>~/tesstutorial/e13boutput/basetrain.log >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Test with base_checkpoint: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmeval --model >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --eval_listfile >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Combining output files: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --stop_training \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --continue_from >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \ >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --model_output >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/eng.traineddata >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Test with eng.traineddata: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract e13b.png out --tessdata-dir >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /home/koichi/tesstutorial/e13boutput >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The training from scratch ended as: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> At iteration 561/2500/2500, Mean >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> rms=0.159%, delta=0%, char train=0%, word >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> train=0%, skip ratio=0%, New >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> best char error = 0 wrote best >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> model:/home/koichi/tesstutorial/e13 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/77754ce0-ecac-4ec1-9d35-3acaac29508d%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/77754ce0-ecac-4ec1-9d35-3acaac29508d%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWo3%3DyZ4LOy9cRiDk-VWVWWaDA35-t6T94GdHEgY3RAHw%40mail.gmail.com.