Re: [tesseract-ocr] Trained data for E13B font

'Mamadou' via tesseract-ocr Fri, 09 Aug 2019 07:38:41 -0700


On Friday, August 9, 2019 at 10:40:15 AM UTC+2, ElGato ElMago wrote:
>
> I added eng.traineddata and LICENSE.  I used my account name in the 
> license file.  I don't know if it's appropriate or not.  Please tell me if 
> it's not.
>
It's ok.
Thanks. I'll share our dataset (real life samples) in the coming days.


>
> 2019年8月9日金曜日 16時17分41秒 UTC+9 Mamadou:
>>
>>
>>
>> On Friday, August 9, 2019 at 7:31:03 AM UTC+2, ElGato ElMago wrote:
>>>
>>> Here's my sharing on GitHub.  Hope it's of any use for somebody.
>>>
>>> https://github.com/ElMagoElGato/tess_e13b_training
>>>
>> Thanks for sharing your experience with us.
>> Is it possible to share your Tesseract model (xxx.traineddata)?
>> We're building a dataset using real life images like what we have already 
>> done for MRZ (
>> https://github.com/DoubangoTelecom/tesseractMRZ/tree/master/dataset).
>> Your model would help us to automated the annotation and will speedup our 
>> devs. Off course we'll have to manualy correct the annotations but it will 
>> be faster for us. 
>> Also, please add a license to your repo so that we know if we have right 
>> to use it
>>
>>>
>>>
>>> 2019年8月8日木曜日 9時35分17秒 UTC+9 ElGato ElMago:
>>>>
>>>> OK, I'll do so.  I need to reorganize naming and so on a little bit.  
>>>> Will be out there soon.
>>>>
>>>> 2019年8月7日水曜日 21時11分01秒 UTC+9 Mamadou:
>>>>>
>>>>>
>>>>>
>>>>> On Wednesday, August 7, 2019 at 2:36:52 AM UTC+2, ElGato ElMago wrote:
>>>>>>
>>>>>> HI,
>>>>>>
>>>>>> I'm thinking of sharing it of course.  What is the best way to do 
>>>>>> it?  After all this, the contribution part of mine is only how I 
>>>>>> prepared 
>>>>>> the training text.  Even that is consist of Shree's text and mine.  The 
>>>>>> instructions and tools I used already exist.
>>>>>>
>>>>> If you have a Github account just create a repo and publish the data 
>>>>> and instructions. 
>>>>>
>>>>>>
>>>>>> ElMagoElGato
>>>>>>
>>>>>> 2019年8月7日水曜日 8時20分02秒 UTC+9 Mamadou:
>>>>>>
>>>>>>> Hello,
>>>>>>> Are you planning to release the dataset or models?
>>>>>>> I'm working on the same subject and planning to share both under BSD 
>>>>>>> terms
>>>>>>>
>>>>>>> On Tuesday, August 6, 2019 at 10:11:40 AM UTC+2, ElGato ElMago wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> FWIW, I got to the point where I can feel happy with the accuracy. 
>>>>>>>> As the images of the previous post show, the symbols, especially on-us 
>>>>>>>> symbol and amount symbol, were causing mix-up each other or to another 
>>>>>>>> character.  I added much more more symbols to the training text and 
>>>>>>>> formed 
>>>>>>>> words that start with a symbol.  One example is as follows:
>>>>>>>>
>>>>>>>> 9;:;=;<;< <0<1<3<4;6;8;9;:;=;
>>>>>>>>
>>>>>>>>
>>>>>>>> I randomly made 8,000 lines like this.  In fine-tuning from eng, 
>>>>>>>> 5,000 iteration was almost good.  Amount symbol still is confused a 
>>>>>>>> little 
>>>>>>>> when it's followed by 0.  Fine tuning tends to be dragged by small 
>>>>>>>> particles.  I'll have to think of something to make further 
>>>>>>>> improvement.
>>>>>>>>
>>>>>>>> Training from scratch produced a bit more stable traineddata.  It 
>>>>>>>> doesn't get confused with symbols so often but tends to generate extra 
>>>>>>>> spaces.  By 10,000 iterations, those spaces are gone and recognition 
>>>>>>>> became 
>>>>>>>> very solid.
>>>>>>>>
>>>>>>>> I thought I might have to do image and box file training but I 
>>>>>>>> guess it's not needed this time.
>>>>>>>>
>>>>>>>> ElMagoElGato
>>>>>>>>
>>>>>>>> 2019年7月26日金曜日 14時08分06秒 UTC+9 ElGato ElMago:
>>>>>>>>>
>>>>>>>>> HI,
>>>>>>>>>
>>>>>>>>> Well, I read the description of ScrollView (
>>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging) 
>>>>>>>>> and it says:
>>>>>>>>>
>>>>>>>>> To show the characters, deselect DISPLAY/Bounding Boxes, select 
>>>>>>>>> DISPLAY/Polygonal Approx and then select OTHER/Uniform display.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> It basically works.  But for some reason, it doesn't work on my 
>>>>>>>>> e13b image and ends up with a blue screen.  Anyway, it shows each box 
>>>>>>>>> separately when a character is consist of multiple boxes.  I'd like 
>>>>>>>>> to show 
>>>>>>>>> the box for the whole character.  ScrollView doesn't do it, at least, 
>>>>>>>>> yet.  
>>>>>>>>> I'll do it on my own.
>>>>>>>>>
>>>>>>>>> ElMagoElGato
>>>>>>>>>
>>>>>>>>> 2019年7月24日水曜日 14時10分46秒 UTC+9 ElGato ElMago:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I got this result from hocr.  This is where one of the phantom 
>>>>>>>>>> characters comes from.
>>>>>>>>>>
>>>>>>>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 902 1262 933; 
>>>>>>>>>> x_conf 98.864532'>&lt;</span>
>>>>>>>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 904 1281 933; 
>>>>>>>>>> x_conf 99.018097'>;</span>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The firs character is the phantom.  It starts with the second 
>>>>>>>>>> character that exists on x axis.  The first character only has 3 
>>>>>>>>>> points 
>>>>>>>>>> width.  I attach ScrollView screen shots that visualize this.
>>>>>>>>>>
>>>>>>>>>> [image: 2019-07-24-132643_854x707_scrot.png][image: 
>>>>>>>>>> 2019-07-24-132800_854x707_scrot.png]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> There seem to be some more cases to cause phantom characters.  
>>>>>>>>>> I'll look them in.  But I have a trivial question now.  I made 
>>>>>>>>>> ScrollView 
>>>>>>>>>> show these displays by accidentally clicking Display->Blamer menu.  
>>>>>>>>>> There 
>>>>>>>>>> is Bounding Boxes menu below but it ends up showing a blue screen 
>>>>>>>>>> though it 
>>>>>>>>>> briefly shows boxes on the way.  Can I use this menu at all?  It'll 
>>>>>>>>>> be very 
>>>>>>>>>> useful.
>>>>>>>>>>
>>>>>>>>>> [image: 2019-07-24-140739_854x707_scrot.png]
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2019年7月23日火曜日 17時10分36秒 UTC+9 ElGato ElMago:
>>>>>>>>>>>
>>>>>>>>>>> It's great! Perfect!  Thanks a lot!
>>>>>>>>>>>
>>>>>>>>>>> 2019年7月23日火曜日 10時56分58秒 UTC+9 shree:
>>>>>>>>>>>>
>>>>>>>>>>>> See https://github.com/tesseract-ocr/tesseract/issues/2580
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, 23 Jul 2019, 06:23 ElGato ElMago, <elmago...@gmail.com> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I read the output of hocr with lstm_choice_mode = 4 as to the 
>>>>>>>>>>>>> pull request 2554.  It shows the candidates for each character 
>>>>>>>>>>>>> but doesn't 
>>>>>>>>>>>>> show bounding box of each character.  I only shows the box for a 
>>>>>>>>>>>>> whole word.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I see bounding boxes of each character in comments of the pull 
>>>>>>>>>>>>> request 2576.  How can I do that?  Do I have to look in the 
>>>>>>>>>>>>> source code and 
>>>>>>>>>>>>> manipulate such an output on my own?
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019年7月19日金曜日 18時40分49秒 UTC+9 ElGato ElMago:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Lorenzo,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I haven't been checking psm too much.  Will turn to those 
>>>>>>>>>>>>>> options after I see how it goes with bounding boxes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Shree,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I see the merges in the git log and also see that new 
>>>>>>>>>>>>>> option lstm_choice_amount works now.  I guess my executable is 
>>>>>>>>>>>>>> latest 
>>>>>>>>>>>>>> though I still see the phantom character.  Hocr makes huge and 
>>>>>>>>>>>>>> complex 
>>>>>>>>>>>>>> output.  I'll take some to read it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019年7月19日金曜日 18時20分55秒 UTC+9 Claudiu:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is there any way to pass bounding boxes to use to the LSTM? 
>>>>>>>>>>>>>>> We have an algorithm that cleanly gets bounding boxes of MRZ 
>>>>>>>>>>>>>>> characters. 
>>>>>>>>>>>>>>> However the results using psm 10 are worse than passing the 
>>>>>>>>>>>>>>> whole line in. 
>>>>>>>>>>>>>>> Yet when we pass the whole line in we get these phantom 
>>>>>>>>>>>>>>> characters. 
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Should PSM 10 mode work? It often returns “no character” 
>>>>>>>>>>>>>>> where there clearly is one. I can supply a test case if it is 
>>>>>>>>>>>>>>> expected to 
>>>>>>>>>>>>>>> work well. 
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 11:06 AM ElGato ElMago <
>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Lorenzo,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We both have got the same case.  It seems a solution to 
>>>>>>>>>>>>>>>> this problem would save a lot of people.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Shree,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I pulled the current head of master branch but it doesn't 
>>>>>>>>>>>>>>>> seem to contain the merges you pointed that have been merged 3 
>>>>>>>>>>>>>>>> to 4 days 
>>>>>>>>>>>>>>>> ago.  How can I get them?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2019年7月19日金曜日 17時02分53秒 UTC+9 Lorenzo Blz:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> PSM 7 was a partial solution for my specific case, it 
>>>>>>>>>>>>>>>>> improved the situation but did not solve it. Also I could not 
>>>>>>>>>>>>>>>>> use it in 
>>>>>>>>>>>>>>>>> some other cases.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The proper solution is very likely doing more training 
>>>>>>>>>>>>>>>>> with more data, some data augmentation might probably help if 
>>>>>>>>>>>>>>>>> data is 
>>>>>>>>>>>>>>>>> scarce.
>>>>>>>>>>>>>>>>> Also doing less training might help is the training is not 
>>>>>>>>>>>>>>>>> done correctly.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> There are also similar issues on github:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1465
>>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The LSTM engine works like this: it scans the image and 
>>>>>>>>>>>>>>>>> for each "pixel column" does this:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> M M M M N M M M [BLANK] F F F F
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> (here i report only the highest probability characters)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In the example above an M is partially seen as an N, this 
>>>>>>>>>>>>>>>>> is normal, and another step of the algorithm (beam search I 
>>>>>>>>>>>>>>>>> think) tries to 
>>>>>>>>>>>>>>>>> aggregate back the correct characters.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think cases like this:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> M M M N N N M M
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> are what gives the phantom characters. More training 
>>>>>>>>>>>>>>>>> should reduce the source of the problem or a painful analysis 
>>>>>>>>>>>>>>>>> of the 
>>>>>>>>>>>>>>>>> bounding boxes might fix some cases.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I used the attached script for the boxes.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Lorenzo
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Il giorno ven 19 lug 2019 alle ore 07:25 ElGato ElMago <
>>>>>>>>>>>>>>>>> elmago...@gmail.com> ha scritto:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Let's call them phantom characters then.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Was psm 7 the solution for the issue 1778?  None of the 
>>>>>>>>>>>>>>>>>> psm option didn't solve my problem though I see different 
>>>>>>>>>>>>>>>>>> output.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I use tesseract 5.0-alpha mostly but 4.1 showed the same 
>>>>>>>>>>>>>>>>>> results anyway.  How did you get bounding box for each 
>>>>>>>>>>>>>>>>>> character?  Alto and 
>>>>>>>>>>>>>>>>>> lstmbox only show bbox for a group of characters.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Phantom characters here for me too:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1778
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Are you using 4.1? Bounding boxes were fixed in 4.1 
>>>>>>>>>>>>>>>>>>> maybe this was also improved.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I wrote some code that uses symbols iterator to discard 
>>>>>>>>>>>>>>>>>>> symbols that are clearly duplicated: too small, 
>>>>>>>>>>>>>>>>>>> overlapping, etc. But it 
>>>>>>>>>>>>>>>>>>> was not easy to make it work decently and it is not 100% 
>>>>>>>>>>>>>>>>>>> reliable with 
>>>>>>>>>>>>>>>>>>> false negatives and positives. I cannot share the code and 
>>>>>>>>>>>>>>>>>>> it is quite ugly 
>>>>>>>>>>>>>>>>>>> anyway.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Here there is another MRZ model with training data:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> https://github.com/DoubangoTelecom/tesseractMRZ
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Lorenzo
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu <
>>>>>>>>>>>>>>>>>>> csaf...@gmail.com> ha scritto:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I’m getting the “phantom character” issue as well using 
>>>>>>>>>>>>>>>>>>>> the OCRB that Shree trained on MRZ lines. For example for 
>>>>>>>>>>>>>>>>>>>> a 0 it will 
>>>>>>>>>>>>>>>>>>>> sometimes add both a 0 and an O to the output , thus 
>>>>>>>>>>>>>>>>>>>> outputting 45 
>>>>>>>>>>>>>>>>>>>> characters total instead of 44. I haven’t looked at the 
>>>>>>>>>>>>>>>>>>>> bounding box output 
>>>>>>>>>>>>>>>>>>>> yet but I suspect a phantom thin character is added 
>>>>>>>>>>>>>>>>>>>> somewhere that I can 
>>>>>>>>>>>>>>>>>>>> discard .. or maybe two chars will have the same bounding 
>>>>>>>>>>>>>>>>>>>> box. If anyone 
>>>>>>>>>>>>>>>>>>>> else has fixed this issue further up (eg so the output 
>>>>>>>>>>>>>>>>>>>> doesn’t contain the 
>>>>>>>>>>>>>>>>>>>> phantom characters in the first place) id be interested. 
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I'll go back to more of training later.  Before doing 
>>>>>>>>>>>>>>>>>>>>> so, I'd like to investigate results a little bit.  The 
>>>>>>>>>>>>>>>>>>>>> hocr and lstmbox 
>>>>>>>>>>>>>>>>>>>>> options give some details of positions of characters.  
>>>>>>>>>>>>>>>>>>>>> The results show 
>>>>>>>>>>>>>>>>>>>>> positions that perfectly correspond to letters in the 
>>>>>>>>>>>>>>>>>>>>> image.  But the text 
>>>>>>>>>>>>>>>>>>>>> output contains a character that obviously does not exist.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Then I found a config file 'lstmdebug' that generates 
>>>>>>>>>>>>>>>>>>>>> far more information.  I hope it explains what happened 
>>>>>>>>>>>>>>>>>>>>> with each 
>>>>>>>>>>>>>>>>>>>>> character.  I'm yet to read the debug output but I'd 
>>>>>>>>>>>>>>>>>>>>> appreciate it if 
>>>>>>>>>>>>>>>>>>>>> someone could tell me how to read it because it's really 
>>>>>>>>>>>>>>>>>>>>> complex.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 2019年6月14日金曜日 19時58分49秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_MICR
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I have uploaded my files there. 
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh
>>>>>>>>>>>>>>>>>>>>>> is the bash script that runs the training.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> You can modify as needed. Please note this is for 
>>>>>>>>>>>>>>>>>>>>>> legacy/base tesseract --oem 0.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks a lot, shree.  It seems you know everything.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I tried the MICR0.traineddata and the first two 
>>>>>>>>>>>>>>>>>>>>>>> mcr.traineddata.  The last one was blocked by the 
>>>>>>>>>>>>>>>>>>>>>>> browser.  Each of the 
>>>>>>>>>>>>>>>>>>>>>>> traineddata had mixed results.  All of them are getting 
>>>>>>>>>>>>>>>>>>>>>>> symbols fairly good 
>>>>>>>>>>>>>>>>>>>>>>> but getting spaces randomly and reading some numbers 
>>>>>>>>>>>>>>>>>>>>>>> wrong.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> MICR0 seems the best among them.  Did you suggest 
>>>>>>>>>>>>>>>>>>>>>>> that you'd be able to update it?  It gets tripple D 
>>>>>>>>>>>>>>>>>>>>>>> very often where 
>>>>>>>>>>>>>>>>>>>>>>> there's only one, and so on.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Also, I tried to fine tune from MICR0 but I found 
>>>>>>>>>>>>>>>>>>>>>>> that I need to change the language-specific.sh.  It 
>>>>>>>>>>>>>>>>>>>>>>> specifies some 
>>>>>>>>>>>>>>>>>>>>>>> parameters for each language.  Do you have any guidance 
>>>>>>>>>>>>>>>>>>>>>>> for it?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> see 
>>>>>>>>>>>>>>>>>>>>>>>> http://www.devscope.net/Content/ocrchecks.aspx 
>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/BigPino67/Tesseract-MICR-OCR
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ
>>>>>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> That'll be nice if there's traineddata out there 
>>>>>>>>>>>>>>>>>>>>>>>>> but I didn't find any.  I see free fonts and 
>>>>>>>>>>>>>>>>>>>>>>>>> commercial OCR software but 
>>>>>>>>>>>>>>>>>>>>>>>>> not traineddata.  Tessdata repository obviously 
>>>>>>>>>>>>>>>>>>>>>>>>> doesn't have one, either.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Please also search for existing MICR traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>>> files.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> So I did several tests from scratch.  In the 
>>>>>>>>>>>>>>>>>>>>>>>>>>> last attempt, I made a training text with 4,000 
>>>>>>>>>>>>>>>>>>>>>>>>>>> lines in the following 
>>>>>>>>>>>>>>>>>>>>>>>>>>> format,
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> 110004310510<   <02 :4002=0181:801= 0008752 
>>>>>>>>>>>>>>>>>>>>>>>>>>> <00039 ;0000001000;
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> and combined it with eng.digits.training_text in 
>>>>>>>>>>>>>>>>>>>>>>>>>>> which symbols are converted to E13B symbols.  This 
>>>>>>>>>>>>>>>>>>>>>>>>>>> makes about 12,000 lines 
>>>>>>>>>>>>>>>>>>>>>>>>>>> of training text.  It's amazing that this thing 
>>>>>>>>>>>>>>>>>>>>>>>>>>> generates a good reader out 
>>>>>>>>>>>>>>>>>>>>>>>>>>> of nowhere.  But then it is not very good.  For 
>>>>>>>>>>>>>>>>>>>>>>>>>>> example:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134;
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> is a result on the image attached.  It's close 
>>>>>>>>>>>>>>>>>>>>>>>>>>> but the last '<' in the result text doesn't exist 
>>>>>>>>>>>>>>>>>>>>>>>>>>> on the image.  It's a 
>>>>>>>>>>>>>>>>>>>>>>>>>>> small failure but it causes a greater trouble in 
>>>>>>>>>>>>>>>>>>>>>>>>>>> parsing.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> What would you suggest from here to increase 
>>>>>>>>>>>>>>>>>>>>>>>>>>> accuracy?  
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Increase the number of lines in the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>    training text
>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Mix up more variations in the training text
>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Increase the number of iterations
>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Investigate wrong reads one by one
>>>>>>>>>>>>>>>>>>>>>>>>>>>    - Or else?
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Also, I referred to engrestrict*.* and could 
>>>>>>>>>>>>>>>>>>>>>>>>>>> generate similar result with the 
>>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning-from-full method.  It seems a 
>>>>>>>>>>>>>>>>>>>>>>>>>>> bit faster to get to the same level but it also 
>>>>>>>>>>>>>>>>>>>>>>>>>>> stops at a 'good' level.  I 
>>>>>>>>>>>>>>>>>>>>>>>>>>> can go with either way if it takes me to the bright 
>>>>>>>>>>>>>>>>>>>>>>>>>>> future.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks a lot, Shree. I'll look it in.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> See 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Look at the files engrestrict*.* and also 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Create training text of about 100 lines and 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> finetune for 400 lines 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I had about 14 lines as attached.  How many 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines would you recommend?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Fine tuning gives much better result but it 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tends to pick other character than in E13B that 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only has 14 characters, 0 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> through 9 and 4 symbols.  I thought training 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> from scratch would eliminate 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> such confusion.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For training from scratch a large training 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> text and hundreds of thousands of iterations 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> are recommended. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If you are just fine tuning for a font try 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> to follow instructions for training for impact, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with your font.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, Shree.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I saw the instruction.  The steps I 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> made are as follows:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Using tesstrain.sh:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/tesstrain.sh --fonts_dir 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /usr/share/fonts --lang eng --linedata_only \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --noextract_font_properties 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --langdata_dir ../langdata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --tessdata_dir ./tessdata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --fontlist "E13Bnsd" --output_dir 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --training_text 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ../langdata/eng/eng.training_e13b_text
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Training from scratch:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --debug_interval 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 100 \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --model_output 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base --learning_rate 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 20e-4 \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --train_listfile 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --eval_listfile 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --max_iterations 5000 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> &>~/tesstutorial/e13boutput/basetrain.log
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Test with base_checkpoint:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmeval --model 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --eval_listfile 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Combining output files:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --stop_training \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --continue_from 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --model_output 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/eng.traineddata
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Test with eng.traineddata:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract e13b.png out --tessdata-dir 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /home/koichi/tesstutorial/e13boutput
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The training from scratch ended as:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> At iteration 561/2500/2500, Mean 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> rms=0.159%, delta=0%, char train=0%, word 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> train=0%, skip ratio=0%,  New 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> best char error = 0 wrote best 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> model:/home/koichi/tesstutorial/e13
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/77754ce0-ecac-4ec1-9d35-3acaac29508d%40googlegroups.com.

Re: [tesseract-ocr] Trained data for E13B font

Reply via email to