Re: [tesseract-ocr] Trained data for E13B font

Mamadou Mon, 16 Sep 2019 22:45:06 -0700


Hello,



Thanks again for sharing your E-13B traineddata, it was helpful. 
We’ve managed to get good accuracy for E-13B with Tesseract but failed with 
CMC-7. So, we ended using TensorFlow for both fonts.

I’m curious to know which level of accuracy you’ve reached. You can check our 
accuracy for Tesseract using app at 
https://github.com/DoubangoTelecom/tesseractMICR#the-recognizer-app. For 
Tensorflow at https://www.doubango.org/webapps/micr/. 

Also, have you tried with real life samples (e.g. random images from Google 
search)? Why are you including the SPACE in your charset and training data? It 
makes the convergence harder.

As promised, the dataset is hosted at 
https://github.com/DoubangoTelecom/tesseractMICR


On Friday, August 9, 2019 at 10:40:15 AM UTC+2, ElGato ElMago wrote:
>
> I added eng.traineddata and LICENSE.  I used my account name in the 
> license file.  I don't know if it's appropriate or not.  Please tell me if 
> it's not.
>
> 2019年8月9日金曜日 16時17分41秒 UTC+9 Mamadou:
>>
>>
>>
>> On Friday, August 9, 2019 at 7:31:03 AM UTC+2, ElGato ElMago wrote:
>>>
>>> Here's my sharing on GitHub.  Hope it's of any use for somebody.
>>>
>>> https://github.com/ElMagoElGato/tess_e13b_training
>>>
>> Thanks for sharing your experience with us.
>> Is it possible to share your Tesseract model (xxx.traineddata)?
>> We're building a dataset using real life images like what we have already 
>> done for MRZ (
>> https://github.com/DoubangoTelecom/tesseractMRZ/tree/master/dataset).
>> Your model would help us to automated the annotation and will speedup our 
>> devs. Off course we'll have to manualy correct the annotations but it will 
>> be faster for us. 
>> Also, please add a license to your repo so that we know if we have right 
>> to use it
>>
>>>
>>>
>>> 2019年8月8日木曜日 9時35分17秒 UTC+9 ElGato ElMago:
>>>>
>>>> OK, I'll do so.  I need to reorganize naming and so on a little bit.  
>>>> Will be out there soon.
>>>>
>>>> 2019年8月7日水曜日 21時11分01秒 UTC+9 Mamadou:
>>>>>
>>>>>
>>>>>
>>>>> On Wednesday, August 7, 2019 at 2:36:52 AM UTC+2, ElGato ElMago wrote:
>>>>>>
>>>>>> HI,
>>>>>>
>>>>>> I'm thinking of sharing it of course.  What is the best way to do 
>>>>>> it?  After all this, the contribution part of mine is only how I 
>>>>>> prepared 
>>>>>> the training text.  Even that is consist of Shree's text and mine.  The 
>>>>>> instructions and tools I used already exist.
>>>>>>
>>>>> If you have a Github account just create a repo and publish the data 
>>>>> and instructions. 
>>>>>
>>>>>>
>>>>>> ElMagoElGato
>>>>>>
>>>>>> 2019年8月7日水曜日 8時20分02秒 UTC+9 Mamadou:
>>>>>>
>>>>>>> Hello,
>>>>>>> Are you planning to release the dataset or models?
>>>>>>> I'm working on the same subject and planning to share both under BSD 
>>>>>>> terms
>>>>>>>
>>>>>>> On Tuesday, August 6, 2019 at 10:11:40 AM UTC+2, ElGato ElMago wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> FWIW, I got to the point where I can feel happy with the accuracy. 
>>>>>>>> As the images of the previous post show, the symbols, especially on-us 
>>>>>>>> symbol and amount symbol, were causing mix-up each other or to another 
>>>>>>>> character.  I added much more more symbols to the training text and 
>>>>>>>> formed 
>>>>>>>> words that start with a symbol.  One example is as follows:
>>>>>>>>
>>>>>>>> 9;:;=;<;< <0<1<3<4;6;8;9;:;=;
>>>>>>>>
>>>>>>>>
>>>>>>>> I randomly made 8,000 lines like this.  In fine-tuning from eng, 
>>>>>>>> 5,000 iteration was almost good.  Amount symbol still is confused a 
>>>>>>>> little 
>>>>>>>> when it's followed by 0.  Fine tuning tends to be dragged by small 
>>>>>>>> particles.  I'll have to think of something to make further 
>>>>>>>> improvement.
>>>>>>>>
>>>>>>>> Training from scratch produced a bit more stable traineddata.  It 
>>>>>>>> doesn't get confused with symbols so often but tends to generate extra 
>>>>>>>> spaces.  By 10,000 iterations, those spaces are gone and recognition 
>>>>>>>> became 
>>>>>>>> very solid.
>>>>>>>>
>>>>>>>> I thought I might have to do image and box file training but I 
>>>>>>>> guess it's not needed this time.
>>>>>>>>
>>>>>>>> ElMagoElGato
>>>>>>>>
>>>>>>>> 2019年7月26日金曜日 14時08分06秒 UTC+9 ElGato ElMago:
>>>>>>>>>
>>>>>>>>> HI,
>>>>>>>>>
>>>>>>>>> Well, I read the description of ScrollView (
>>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging) 
>>>>>>>>> and it says:
>>>>>>>>>
>>>>>>>>> To show the characters, deselect DISPLAY/Bounding Boxes, select 
>>>>>>>>> DISPLAY/Polygonal Approx and then select OTHER/Uniform display.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> It basically works.  But for some reason, it doesn't work on my 
>>>>>>>>> e13b image and ends up with a blue screen.  Anyway, it shows each box 
>>>>>>>>> separately when a character is consist of multiple boxes.  I'd like 
>>>>>>>>> to show 
>>>>>>>>> the box for the whole character.  ScrollView doesn't do it, at least, 
>>>>>>>>> yet.  
>>>>>>>>> I'll do it on my own.
>>>>>>>>>
>>>>>>>>> ElMagoElGato
>>>>>>>>>
>>>>>>>>> 2019年7月24日水曜日 14時10分46秒 UTC+9 ElGato ElMago:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I got this result from hocr.  This is where one of the phantom 
>>>>>>>>>> characters comes from.
>>>>>>>>>>
>>>>>>>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 902 1262 933; 
>>>>>>>>>> x_conf 98.864532'>&lt;</span>
>>>>>>>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 904 1281 933; 
>>>>>>>>>> x_conf 99.018097'>;</span>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The firs character is the phantom.  It starts with the second 
>>>>>>>>>> character that exists on x axis.  The first character only has 3 
>>>>>>>>>> points 
>>>>>>>>>> width.  I attach ScrollView screen shots that visualize this.
>>>>>>>>>>
>>>>>>>>>> [image: 2019-07-24-132643_854x707_scrot.png][image: 
>>>>>>>>>> 2019-07-24-132800_854x707_scrot.png]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> There seem to be some more cases to cause phantom characters.  
>>>>>>>>>> I'll look them in.  But I have a trivial question now.  I made 
>>>>>>>>>> ScrollView 
>>>>>>>>>> show these displays by accidentally clicking Display->Blamer menu.  
>>>>>>>>>> There 
>>>>>>>>>> is Bounding Boxes menu below but it ends up showing a blue screen 
>>>>>>>>>> though it 
>>>>>>>>>> briefly shows boxes on the way.  Can I use this menu at all?  It'll 
>>>>>>>>>> be very 
>>>>>>>>>> useful.
>>>>>>>>>>
>>>>>>>>>> [image: 2019-07-24-140739_854x707_scrot.png]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2019年7月23日火曜日 17時10分36秒 UTC+9 ElGato ElMago:
>>>>>>>>>>>
>>>>>>>>>>> It's great! Perfect!  Thanks a lot!
>>>>>>>>>>>
>>>>>>>>>>> 2019年7月23日火曜日 10時56分58秒 UTC+9 shree:
>>>>>>>>>>>>
>>>>>>>>>>>> See https://github.com/tesseract-ocr/tesseract/issues/2580
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, 23 Jul 2019, 06:23 ElGato ElMago, <elmago...@gmail.com> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I read the output of hocr with lstm_choice_mode = 4 as to the 
>>>>>>>>>>>>> pull request 2554.  It shows the candidates for each character 
>>>>>>>>>>>>> but doesn't 
>>>>>>>>>>>>> show bounding box of each character.  I only shows the box for a 
>>>>>>>>>>>>> whole word.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I see bounding boxes of each character in comments of the pull 
>>>>>>>>>>>>> request 2576.  How can I do that?  Do I have to look in the 
>>>>>>>>>>>>> source code and 
>>>>>>>>>>>>> manipulate such an output on my own?
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019年7月19日金曜日 18時40分49秒 UTC+9 ElGato ElMago:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Lorenzo,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I haven't been checking psm too much.  Will turn to those 
>>>>>>>>>>>>>> options after I see how it goes with bounding boxes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Shree,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I see the merges in the git log and also see that new 
>>>>>>>>>>>>>> option lstm_choice_amount works now.  I guess my executable is 
>>>>>>>>>>>>>> latest 
>>>>>>>>>>>>>> though I still see the phantom character.  Hocr makes huge and 
>>>>>>>>>>>>>> complex 
>>>>>>>>>>>>>> output.  I'll take some to read it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019年7月19日金曜日 18時20分55秒 UTC+9 Claudiu:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is there any way to pass bounding boxes to use to the LSTM? 
>>>>>>>>>>>>>>> We have an algorithm that cleanly gets bounding boxes of MRZ 
>>>>>>>>>>>>>>> characters. 
>>>>>>>>>>>>>>> However the results using psm 10 are worse than passing the 
>>>>>>>>>>>>>>> whole line in. 
>>>>>>>>>>>>>>> Yet when we pass the whole line in we get these phantom 
>>>>>>>>>>>>>>> characters. 
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Should PSM 10 mode work? It often returns “no character” 
>>>>>>>>>>>>>>> where there clearly is one. I can supply a test case if it is 
>>>>>>>>>>>>>>> expected to 
>>>>>>>>>>>>>>> work well. 
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 11:06 AM ElGato ElMago <
>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Lorenzo,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We both have got the same case.  It seems a solution to 
>>>>>>>>>>>>>>>> this problem would save a lot of people.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Shree,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I pulled the current head of master branch but it doesn't 
>>>>>>>>>>>>>>>> seem to contain the merges you pointed that have been merged 3 
>>>>>>>>>>>>>>>> to 4 days 
>>>>>>>>>>>>>>>> ago.  How can I get them?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2019年7月19日金曜日 17時02分53秒 UTC+9 Lorenzo Blz:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> PSM 7 was a partial solution for my specific case, it 
>>>>>>>>>>>>>>>>> improved the situation but did not solve it. Also I could not 
>>>>>>>>>>>>>>>>> use it in 
>>>>>>>>>>>>>>>>> some other cases.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The proper solution is very likely doing more training 
>>>>>>>>>>>>>>>>> with more data, some data augmentation might probably help if 
>>>>>>>>>>>>>>>>> data is 
>>>>>>>>>>>>>>>>> scarce.
>>>>>>>>>>>>>>>>> Also doing less training might help is the training is not 
>>>>>>>>>>>>>>>>> done correctly.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> There are also similar issues on github:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1465
>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The LSTM engine works like this: it scans the image and 
>>>>>>>>>>>>>>>>> for each "pixel column" does this:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> M M M M N M M M [BLANK] F F F F
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (here i report only the highest probability characters)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In the example above an M is partially seen as an N, this 
>>>>>>>>>>>>>>>>> is normal, and another step of the algorithm (beam search I 
>>>>>>>>>>>>>>>>> think) tries to 
>>>>>>>>>>>>>>>>> aggregate back the correct characters.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think cases like this:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> M M M N N N M M
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> are what gives the phantom characters. More training 
>>>>>>>>>>>>>>>>> should reduce the source of the problem or a painful analysis 
>>>>>>>>>>>>>>>>> of the 
>>>>>>>>>>>>>>>>> bounding boxes might fix some cases.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I used the attached script for the boxes.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Lorenzo
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Il giorno ven 19 lug 2019 alle ore 07:25 ElGato ElMago <
>>>>>>>>>>>>>>>>> elmago...@gmail.com> ha scritto:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Let's call them phantom characters then.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Was psm 7 the solution for the issue 1778?  None of the 
>>>>>>>>>>>>>>>>>> psm option didn't solve my problem though I see different 
>>>>>>>>>>>>>>>>>> output.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I use tesseract 5.0-alpha mostly but 4.1 showed the same 
>>>>>>>>>>>>>>>>>> results anyway.  How did you get bounding box for each 
>>>>>>>>>>>>>>>>>> character?  Alto and 
>>>>>>>>>>>>>>>>>> lstmbox only show bbox for a group of characters.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Phantom characters here for me too:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1778
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Are you using 4.1? Bounding boxes were fixed in 4.1 
>>>>>>>>>>>>>>>>>>> maybe this was also improved.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I wrote some code that uses symbols iterator to discard 
>>>>>>>>>>>>>>>>>>> symbols that are clearly duplicated: too small, 
>>>>>>>>>>>>>>>>>>> overlapping, etc. But it 
>>>>>>>>>>>>>>>>>>> was not easy to make it work decently and it is not 100% 
>>>>>>>>>>>>>>>>>>> reliable with 
>>>>>>>>>>>>>>>>>>> false negatives and positives. I cannot share the code and 
>>>>>>>>>>>>>>>>>>> it is quite ugly 
>>>>>>>>>>>>>>>>>>> anyway.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Here there is another MRZ model with training data:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> https://github.com/DoubangoTelecom/tesseractMRZ
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Lorenzo
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu <
>>>>>>>>>>>>>>>>>>> csaf...@gmail.com> ha scritto:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I’m getting the “phantom character” issue as well using 
>>>>>>>>>>>>>>>>>>>> the OCRB that Shree trained on MRZ lines. For example for 
>>>>>>>>>>>>>>>>>>>> a 0 it will 
>>>>>>>>>>>>>>>>>>>> sometimes add both a 0 and an O to the output , thus 
>>>>>>>>>>>>>>>>>>>> outputting 45 
>>>>>>>>>>>>>>>>>>>> characters total instead of 44. I haven’t looked at the 
>>>>>>>>>>>>>>>>>>>> bounding box output 
>>>>>>>>>>>>>>>>>>>> yet but I suspect a phantom thin character is added 
>>>>>>>>>>>>>>>>>>>> somewhere that I can 
>>>>>>>>>>>>>>>>>>>> discard .. or maybe two chars will have the same bounding 
>>>>>>>>>>>>>>>>>>>> box. If anyone 
>>>>>>>>>>>>>>>>>>>> else has fixed this issue further up (eg so the output 
>>>>>>>>>>>>>>>>>>>> doesn’t contain the 
>>>>>>>>>>>>>>>>>>>> phantom characters in the first place) id be interested. 
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I'll go back to more of training later.  Before doing 
>>>>>>>>>>>>>>>>>>>>> so, I'd like to investigate results a little bit.  The 
>>>>>>>>>>>>>>>>>>>>> hocr and lstmbox 
>>>>>>>>>>>>>>>>>>>>> options give some details of positions of characters.  
>>>>>>>>>>>>>>>>>>>>> The results show 
>>>>>>>>>>>>>>>>>>>>> positions that perfectly correspond to letters in the 
>>>>>>>>>>>>>>>>>>>>> image.  But the text 
>>>>>>>>>>>>>>>>>>>>> output contains a character that obviously does not exist.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Then I found a config file 'lstmdebug' that generates 
>>>>>>>>>>>>>>>>>>>>> far more information.  I hope it explains what happened 
>>>>>>>>>>>>>>>>>>>>> with each 
>>>>>>>>>>>>>>>>>>>>> character.  I'm yet to read the debug output but I'd 
>>>>>>>>>>>>>>>>>>>>> appreciate it if 
>>>>>>>>>>>>>>>>>>>>> someone could tell me how to read it because it's really 
>>>>>>>>>>>>>>>>>>>>> complex.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 2019年6月14日金曜日 19時58分49秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_MICR
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I have uploaded my files there. 
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh
>>>>>>>>>>>>>>>>>>>>>> is the bash script that runs the training.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> You can modify as needed. Please note this is for 
>>>>>>>>>>>>>>>>>>>>>> legacy/base tesseract --oem 0.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks a lot, shree.  It seems you know everything.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I tried the MICR0.traineddata and the first two 
>>>>>>>>>>>>>>>>>>>>>>> mcr.traineddata.  The last one was blocked by the 
>>>>>>>>>>>>>>>>>>>>>>> browser.  Each of the 
>>>>>>>>>>>>>>>>>>>>>>> traineddata had mixed results.  All of them are getting 
>>>>>>>>>>>>>>>>>>>>>>> symbols fairly good 
>>>>>>>>>>>>>>>>>>>>>>> but getting spaces randomly and reading some numbers 
>>>>>>>>>>>>>>>>>>>>>>> wrong.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> MICR0 seems the best among them.  Did you suggest 
>>>>>>>>>>>>>>>>>>>>>>> that you'd be able to update it?  It gets tripple D 
>>>>>>>>>>>>>>>>>>>>>>> very often where 
>>>>>>>>>>>>>>>>>>>>>>> there's only one, and so on.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Also, I tried to fine tune from MICR0 but I found 
>>>>>>>>>>>>>>>>>>>>>>> that I need to change the language-specific.sh.  It 
>>>>>>>>>>>>>>>>>>>>>>> specifies some 
>>>>>>>>>>>>>>>>>>>>>>> parameters for each language.  Do you have any guidance 
>>>>>>>>>>>>>>>>>>>>>>> for it?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> see 
>>>>>>>>>>>>>>>>>>>>>>>> http://www.devscope.net/Content/ocrchecks.aspx 
>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/BigPino67/Tesseract-MICR-OCR
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ
>>>>>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> That'll be nice if there's traineddata out there 
>>>>>>>>>>>>>>>>>>>>>>>>> but I didn't find any.  I see free fonts and 
>>>>>>>>>>>>>>>>>>>>>>>>> commercial OCR software but 
>>>>>>>>>>>>>>>>>>>>>>>>> not traineddata.  Tessdata repository obviously 
>>>>>>>>>>>>>>>>>>>>>>>>> doesn't have one, either.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Please also search for existing MICR traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>>> files.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> So I did several tests from scratch.  In the 
>>>>>>>>>>>>>>>>>>>>>>>>>>> last attempt, I made a training text with 4,000 
>>>>>>>>>>>>>>>>>>>>>>>>>>> lines in the following 
>>>>>>>>>>>>>>>>>>>>>>>>>>> format,
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> 110004310510<   <02 :4002=0181:801= 0008752 
>>>>>>>>>>>>>>>>>>>>>>>>>>> <00039 ;0000001000;
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> and combined it with eng.digits.training_text in 
>>>>>>>>>>>>>>>>>>>>>>>>>>> which symbols are converted to E13B symbols.  This 
>>>>>>>>>>>>>>>>>>>>>>>>>>> makes about 12,000 lines 
>>>>>>>>>>>>>>>>>>>>>>>>>>> of training text.  It's amazing that this thing 
>>>>>>>>>>>>>>>>>>>>>>>>>>> generates a good reader out 
>>>>>>>>>>>>>>>>>>>>>>>>>>> of nowhere.  But then it is not very good.  For 
>>>>>>>>>>>>>>>>>>>>>>>>>>> example:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134;
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> is a result on the image attached.  It's close 
>>>>>>>>>>>>>>>>>>>>>>>>>>> but the last '<' in the result text doesn't exist 
>>>>>>>>>>>>>>>>>>>>>>>>>>> on the image.  It's a 
>>>>>>>>>>>>>>>>>>>>>>>>>>> small failure but it causes a greater trouble in 
>>>>>>>>>>>>>>>>>>>>>>>>>>> parsing.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> What would you suggest from here to increase 
>>>>>>>>>>>>>>>>>>>>>>>>>>> accuracy?  
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Increase the number of lines in the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>    training text
>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Mix up more variations in the training text
>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Increase the number of iterations
>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Investigate wrong reads one by one
>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Or else?
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Also, I referred to engrestrict*.* and could 
>>>>>>>>>>>>>>>>>>>>>>>>>>> generate similar result with the 
>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning-from-full method.  It seems a 
>>>>>>>>>>>>>>>>>>>>>>>>>>> bit faster to get to the same level but it also 
>>>>>>>>>>>>>>>>>>>>>>>>>>> stops at a 'good' level.  I 
>>>>>>>>>>>>>>>>>>>>>>>>>>> can go with either way if it takes me to the bright 
>>>>>>>>>>>>>>>>>>>>>>>>>>> future.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks a lot, Shree. I'll look it in.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> See 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Look at the files engrestrict*.* and also 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Create training text of about 100 lines and 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> finetune for 400 lines 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I had about 14 lines as attached.  How many 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines would you recommend?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Fine tuning gives much better result but it 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tends to pick other character than in E13B that 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only has 14 characters, 0 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> through 9 and 4 symbols.  I thought training 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from scratch would eliminate 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> such confusion.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For training from scratch a large training 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> text and hundreds of thousands of iterations 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are recommended. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If you are just fine tuning for a font try 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to follow instructions for training for impact, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with your font.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, Shree.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I saw the instruction.  The steps I 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> made are as follows:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Using tesstrain.sh:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/tesstrain.sh --fonts_dir 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /usr/share/fonts --lang eng --linedata_only \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --noextract_font_properties 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --langdata_dir ../langdata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --tessdata_dir ./tessdata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --fontlist "E13Bnsd" --output_dir 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --training_text 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ../langdata/eng/eng.training_e13b_text
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Training from scratch:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --debug_interval 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 100 \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --model_output 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base --learning_rate 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 20e-4 \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --train_listfile 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --eval_listfile 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --max_iterations 5000 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> &>~/tesstutorial/e13boutput/basetrain.log
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Test with base_checkpoint:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmeval --model 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --eval_listfile 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Combining output files:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --stop_training \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --continue_from 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --model_output 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/eng.traineddata
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Test with eng.traineddata:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract e13b.png out --tessdata-dir 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /home/koichi/tesstutorial/e13boutput
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The training from scratch ended as:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> At iteration 561/2500/2500, Mean 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> rms=0.159%, delta=0%, char train=0%, word 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> train=0%, skip ratio=0%,  New 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> best char error = 0 wrote best 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> model:/home/koichi/tesstutorial/e13
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/50700bde-8cf4-4910-947b-10618f8c02b9%40googlegroups.com.

Re: [tesseract-ocr] Trained data for E13B font

Reply via email to