Re: [tesseract-ocr] Trained data for E13B font

ElGato ElMago Fri, 09 Aug 2019 01:40:25 -0700

I added eng.traineddata and LICENSE.  I used my account name in the license 
file.  I don't know if it's appropriate or not.  Please tell me if it's not.


2019年8月9日金曜日 16時17分41秒 UTC+9 Mamadou:
>
>
>
> On Friday, August 9, 2019 at 7:31:03 AM UTC+2, ElGato ElMago wrote:
>>
>> Here's my sharing on GitHub.  Hope it's of any use for somebody.
>>
>> https://github.com/ElMagoElGato/tess_e13b_training
>>
> Thanks for sharing your experience with us.
> Is it possible to share your Tesseract model (xxx.traineddata)?
> We're building a dataset using real life images like what we have already 
> done for MRZ (
> https://github.com/DoubangoTelecom/tesseractMRZ/tree/master/dataset).
> Your model would help us to automated the annotation and will speedup our 
> devs. Off course we'll have to manualy correct the annotations but it will 
> be faster for us. 
> Also, please add a license to your repo so that we know if we have right 
> to use it
>
>>
>>
>> 2019年8月8日木曜日 9時35分17秒 UTC+9 ElGato ElMago:
>>>
>>> OK, I'll do so.  I need to reorganize naming and so on a little bit.  
>>> Will be out there soon.
>>>
>>> 2019年8月7日水曜日 21時11分01秒 UTC+9 Mamadou:
>>>>
>>>>
>>>>
>>>> On Wednesday, August 7, 2019 at 2:36:52 AM UTC+2, ElGato ElMago wrote:
>>>>>
>>>>> HI,
>>>>>
>>>>> I'm thinking of sharing it of course.  What is the best way to do it?  
>>>>> After all this, the contribution part of mine is only how I prepared the 
>>>>> training text.  Even that is consist of Shree's text and mine.  The 
>>>>> instructions and tools I used already exist.
>>>>>
>>>> If you have a Github account just create a repo and publish the data 
>>>> and instructions. 
>>>>
>>>>>
>>>>> ElMagoElGato
>>>>>
>>>>> 2019年8月7日水曜日 8時20分02秒 UTC+9 Mamadou:
>>>>>
>>>>>> Hello,
>>>>>> Are you planning to release the dataset or models?
>>>>>> I'm working on the same subject and planning to share both under BSD 
>>>>>> terms
>>>>>>
>>>>>> On Tuesday, August 6, 2019 at 10:11:40 AM UTC+2, ElGato ElMago wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> FWIW, I got to the point where I can feel happy with the accuracy. 
>>>>>>> As the images of the previous post show, the symbols, especially on-us 
>>>>>>> symbol and amount symbol, were causing mix-up each other or to another 
>>>>>>> character.  I added much more more symbols to the training text and 
>>>>>>> formed 
>>>>>>> words that start with a symbol.  One example is as follows:
>>>>>>>
>>>>>>> 9;:;=;<;< <0<1<3<4;6;8;9;:;=;
>>>>>>>
>>>>>>>
>>>>>>> I randomly made 8,000 lines like this.  In fine-tuning from eng, 
>>>>>>> 5,000 iteration was almost good.  Amount symbol still is confused a 
>>>>>>> little 
>>>>>>> when it's followed by 0.  Fine tuning tends to be dragged by small 
>>>>>>> particles.  I'll have to think of something to make further improvement.
>>>>>>>
>>>>>>> Training from scratch produced a bit more stable traineddata.  It 
>>>>>>> doesn't get confused with symbols so often but tends to generate extra 
>>>>>>> spaces.  By 10,000 iterations, those spaces are gone and recognition 
>>>>>>> became 
>>>>>>> very solid.
>>>>>>>
>>>>>>> I thought I might have to do image and box file training but I guess 
>>>>>>> it's not needed this time.
>>>>>>>
>>>>>>> ElMagoElGato
>>>>>>>
>>>>>>> 2019年7月26日金曜日 14時08分06秒 UTC+9 ElGato ElMago:
>>>>>>>>
>>>>>>>> HI,
>>>>>>>>
>>>>>>>> Well, I read the description of ScrollView (
>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging) 
>>>>>>>> and it says:
>>>>>>>>
>>>>>>>> To show the characters, deselect DISPLAY/Bounding Boxes, select 
>>>>>>>> DISPLAY/Polygonal Approx and then select OTHER/Uniform display.
>>>>>>>>
>>>>>>>>
>>>>>>>> It basically works.  But for some reason, it doesn't work on my 
>>>>>>>> e13b image and ends up with a blue screen.  Anyway, it shows each box 
>>>>>>>> separately when a character is consist of multiple boxes.  I'd like to 
>>>>>>>> show 
>>>>>>>> the box for the whole character.  ScrollView doesn't do it, at least, 
>>>>>>>> yet.  
>>>>>>>> I'll do it on my own.
>>>>>>>>
>>>>>>>> ElMagoElGato
>>>>>>>>
>>>>>>>> 2019年7月24日水曜日 14時10分46秒 UTC+9 ElGato ElMago:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I got this result from hocr.  This is where one of the phantom 
>>>>>>>>> characters comes from.
>>>>>>>>>
>>>>>>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 902 1262 933; x_conf 
>>>>>>>>> 98.864532'>&lt;</span>
>>>>>>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 904 1281 933; x_conf 
>>>>>>>>> 99.018097'>;</span>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The firs character is the phantom.  It starts with the second 
>>>>>>>>> character that exists on x axis.  The first character only has 3 
>>>>>>>>> points 
>>>>>>>>> width.  I attach ScrollView screen shots that visualize this.
>>>>>>>>>
>>>>>>>>> [image: 2019-07-24-132643_854x707_scrot.png][image: 
>>>>>>>>> 2019-07-24-132800_854x707_scrot.png]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> There seem to be some more cases to cause phantom characters.  
>>>>>>>>> I'll look them in.  But I have a trivial question now.  I made 
>>>>>>>>> ScrollView 
>>>>>>>>> show these displays by accidentally clicking Display->Blamer menu.  
>>>>>>>>> There 
>>>>>>>>> is Bounding Boxes menu below but it ends up showing a blue screen 
>>>>>>>>> though it 
>>>>>>>>> briefly shows boxes on the way.  Can I use this menu at all?  It'll 
>>>>>>>>> be very 
>>>>>>>>> useful.
>>>>>>>>>
>>>>>>>>> [image: 2019-07-24-140739_854x707_scrot.png]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2019年7月23日火曜日 17時10分36秒 UTC+9 ElGato ElMago:
>>>>>>>>>>
>>>>>>>>>> It's great! Perfect!  Thanks a lot!
>>>>>>>>>>
>>>>>>>>>> 2019年7月23日火曜日 10時56分58秒 UTC+9 shree:
>>>>>>>>>>>
>>>>>>>>>>> See https://github.com/tesseract-ocr/tesseract/issues/2580
>>>>>>>>>>>
>>>>>>>>>>> On Tue, 23 Jul 2019, 06:23 ElGato ElMago, <elmago...@gmail.com> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I read the output of hocr with lstm_choice_mode = 4 as to the 
>>>>>>>>>>>> pull request 2554.  It shows the candidates for each character but 
>>>>>>>>>>>> doesn't 
>>>>>>>>>>>> show bounding box of each character.  I only shows the box for a 
>>>>>>>>>>>> whole word.
>>>>>>>>>>>>
>>>>>>>>>>>> I see bounding boxes of each character in comments of the pull 
>>>>>>>>>>>> request 2576.  How can I do that?  Do I have to look in the source 
>>>>>>>>>>>> code and 
>>>>>>>>>>>> manipulate such an output on my own?
>>>>>>>>>>>>
>>>>>>>>>>>> 2019年7月19日金曜日 18時40分49秒 UTC+9 ElGato ElMago:
>>>>>>>>>>>>
>>>>>>>>>>>>> Lorenzo,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I haven't been checking psm too much.  Will turn to those 
>>>>>>>>>>>>> options after I see how it goes with bounding boxes.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Shree,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I see the merges in the git log and also see that new 
>>>>>>>>>>>>> option lstm_choice_amount works now.  I guess my executable is 
>>>>>>>>>>>>> latest 
>>>>>>>>>>>>> though I still see the phantom character.  Hocr makes huge and 
>>>>>>>>>>>>> complex 
>>>>>>>>>>>>> output.  I'll take some to read it.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019年7月19日金曜日 18時20分55秒 UTC+9 Claudiu:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is there any way to pass bounding boxes to use to the LSTM? 
>>>>>>>>>>>>>> We have an algorithm that cleanly gets bounding boxes of MRZ 
>>>>>>>>>>>>>> characters. 
>>>>>>>>>>>>>> However the results using psm 10 are worse than passing the 
>>>>>>>>>>>>>> whole line in. 
>>>>>>>>>>>>>> Yet when we pass the whole line in we get these phantom 
>>>>>>>>>>>>>> characters. 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Should PSM 10 mode work? It often returns “no character” 
>>>>>>>>>>>>>> where there clearly is one. I can supply a test case if it is 
>>>>>>>>>>>>>> expected to 
>>>>>>>>>>>>>> work well. 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Jul 19, 2019 at 11:06 AM ElGato ElMago <
>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Lorenzo,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We both have got the same case.  It seems a solution to this 
>>>>>>>>>>>>>>> problem would save a lot of people.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Shree,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I pulled the current head of master branch but it doesn't 
>>>>>>>>>>>>>>> seem to contain the merges you pointed that have been merged 3 
>>>>>>>>>>>>>>> to 4 days 
>>>>>>>>>>>>>>> ago.  How can I get them?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019年7月19日金曜日 17時02分53秒 UTC+9 Lorenzo Blz:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> PSM 7 was a partial solution for my specific case, it 
>>>>>>>>>>>>>>>> improved the situation but did not solve it. Also I could not 
>>>>>>>>>>>>>>>> use it in 
>>>>>>>>>>>>>>>> some other cases.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The proper solution is very likely doing more training with 
>>>>>>>>>>>>>>>> more data, some data augmentation might probably help if data 
>>>>>>>>>>>>>>>> is scarce.
>>>>>>>>>>>>>>>> Also doing less training might help is the training is not 
>>>>>>>>>>>>>>>> done correctly.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> There are also similar issues on github:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1465
>>>>>>>>>>>>>>>> ...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The LSTM engine works like this: it scans the image and for 
>>>>>>>>>>>>>>>> each "pixel column" does this:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> M M M M N M M M [BLANK] F F F F
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> (here i report only the highest probability characters)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In the example above an M is partially seen as an N, this 
>>>>>>>>>>>>>>>> is normal, and another step of the algorithm (beam search I 
>>>>>>>>>>>>>>>> think) tries to 
>>>>>>>>>>>>>>>> aggregate back the correct characters.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think cases like this:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> M M M N N N M M
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> are what gives the phantom characters. More training should 
>>>>>>>>>>>>>>>> reduce the source of the problem or a painful analysis of the 
>>>>>>>>>>>>>>>> bounding 
>>>>>>>>>>>>>>>> boxes might fix some cases.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I used the attached script for the boxes.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Lorenzo
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Il giorno ven 19 lug 2019 alle ore 07:25 ElGato ElMago <
>>>>>>>>>>>>>>>> elmago...@gmail.com> ha scritto:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Let's call them phantom characters then.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Was psm 7 the solution for the issue 1778?  None of the 
>>>>>>>>>>>>>>>>> psm option didn't solve my problem though I see different 
>>>>>>>>>>>>>>>>> output.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I use tesseract 5.0-alpha mostly but 4.1 showed the same 
>>>>>>>>>>>>>>>>> results anyway.  How did you get bounding box for each 
>>>>>>>>>>>>>>>>> character?  Alto and 
>>>>>>>>>>>>>>>>> lstmbox only show bbox for a group of characters.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Phantom characters here for me too:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1778
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Are you using 4.1? Bounding boxes were fixed in 4.1 maybe 
>>>>>>>>>>>>>>>>>> this was also improved.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I wrote some code that uses symbols iterator to discard 
>>>>>>>>>>>>>>>>>> symbols that are clearly duplicated: too small, overlapping, 
>>>>>>>>>>>>>>>>>> etc. But it 
>>>>>>>>>>>>>>>>>> was not easy to make it work decently and it is not 100% 
>>>>>>>>>>>>>>>>>> reliable with 
>>>>>>>>>>>>>>>>>> false negatives and positives. I cannot share the code and 
>>>>>>>>>>>>>>>>>> it is quite ugly 
>>>>>>>>>>>>>>>>>> anyway.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Here there is another MRZ model with training data:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> https://github.com/DoubangoTelecom/tesseractMRZ
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Lorenzo
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu <
>>>>>>>>>>>>>>>>>> csaf...@gmail.com> ha scritto:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I’m getting the “phantom character” issue as well using 
>>>>>>>>>>>>>>>>>>> the OCRB that Shree trained on MRZ lines. For example for a 
>>>>>>>>>>>>>>>>>>> 0 it will 
>>>>>>>>>>>>>>>>>>> sometimes add both a 0 and an O to the output , thus 
>>>>>>>>>>>>>>>>>>> outputting 45 
>>>>>>>>>>>>>>>>>>> characters total instead of 44. I haven’t looked at the 
>>>>>>>>>>>>>>>>>>> bounding box output 
>>>>>>>>>>>>>>>>>>> yet but I suspect a phantom thin character is added 
>>>>>>>>>>>>>>>>>>> somewhere that I can 
>>>>>>>>>>>>>>>>>>> discard .. or maybe two chars will have the same bounding 
>>>>>>>>>>>>>>>>>>> box. If anyone 
>>>>>>>>>>>>>>>>>>> else has fixed this issue further up (eg so the output 
>>>>>>>>>>>>>>>>>>> doesn’t contain the 
>>>>>>>>>>>>>>>>>>> phantom characters in the first place) id be interested. 
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago <
>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I'll go back to more of training later.  Before doing 
>>>>>>>>>>>>>>>>>>>> so, I'd like to investigate results a little bit.  The 
>>>>>>>>>>>>>>>>>>>> hocr and lstmbox 
>>>>>>>>>>>>>>>>>>>> options give some details of positions of characters.  The 
>>>>>>>>>>>>>>>>>>>> results show 
>>>>>>>>>>>>>>>>>>>> positions that perfectly correspond to letters in the 
>>>>>>>>>>>>>>>>>>>> image.  But the text 
>>>>>>>>>>>>>>>>>>>> output contains a character that obviously does not exist.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Then I found a config file 'lstmdebug' that generates 
>>>>>>>>>>>>>>>>>>>> far more information.  I hope it explains what happened 
>>>>>>>>>>>>>>>>>>>> with each 
>>>>>>>>>>>>>>>>>>>> character.  I'm yet to read the debug output but I'd 
>>>>>>>>>>>>>>>>>>>> appreciate it if 
>>>>>>>>>>>>>>>>>>>> someone could tell me how to read it because it's really 
>>>>>>>>>>>>>>>>>>>> complex.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 2019年6月14日金曜日 19時58分49秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_MICR
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I have uploaded my files there. 
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh
>>>>>>>>>>>>>>>>>>>>> is the bash script that runs the training.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> You can modify as needed. Please note this is for 
>>>>>>>>>>>>>>>>>>>>> legacy/base tesseract --oem 0.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks a lot, shree.  It seems you know everything.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I tried the MICR0.traineddata and the first two 
>>>>>>>>>>>>>>>>>>>>>> mcr.traineddata.  The last one was blocked by the 
>>>>>>>>>>>>>>>>>>>>>> browser.  Each of the 
>>>>>>>>>>>>>>>>>>>>>> traineddata had mixed results.  All of them are getting 
>>>>>>>>>>>>>>>>>>>>>> symbols fairly good 
>>>>>>>>>>>>>>>>>>>>>> but getting spaces randomly and reading some numbers 
>>>>>>>>>>>>>>>>>>>>>> wrong.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> MICR0 seems the best among them.  Did you suggest 
>>>>>>>>>>>>>>>>>>>>>> that you'd be able to update it?  It gets tripple D very 
>>>>>>>>>>>>>>>>>>>>>> often where 
>>>>>>>>>>>>>>>>>>>>>> there's only one, and so on.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Also, I tried to fine tune from MICR0 but I found 
>>>>>>>>>>>>>>>>>>>>>> that I need to change the language-specific.sh.  It 
>>>>>>>>>>>>>>>>>>>>>> specifies some 
>>>>>>>>>>>>>>>>>>>>>> parameters for each language.  Do you have any guidance 
>>>>>>>>>>>>>>>>>>>>>> for it?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> see 
>>>>>>>>>>>>>>>>>>>>>>> http://www.devscope.net/Content/ocrchecks.aspx 
>>>>>>>>>>>>>>>>>>>>>>> https://github.com/BigPino67/Tesseract-MICR-OCR
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ
>>>>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> That'll be nice if there's traineddata out there 
>>>>>>>>>>>>>>>>>>>>>>>> but I didn't find any.  I see free fonts and 
>>>>>>>>>>>>>>>>>>>>>>>> commercial OCR software but 
>>>>>>>>>>>>>>>>>>>>>>>> not traineddata.  Tessdata repository obviously 
>>>>>>>>>>>>>>>>>>>>>>>> doesn't have one, either.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Please also search for existing MICR traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>> files.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> So I did several tests from scratch.  In the last 
>>>>>>>>>>>>>>>>>>>>>>>>>> attempt, I made a training text with 4,000 lines in 
>>>>>>>>>>>>>>>>>>>>>>>>>> the following format,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> 110004310510<   <02 :4002=0181:801= 0008752 
>>>>>>>>>>>>>>>>>>>>>>>>>> <00039 ;0000001000;
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> and combined it with eng.digits.training_text in 
>>>>>>>>>>>>>>>>>>>>>>>>>> which symbols are converted to E13B symbols.  This 
>>>>>>>>>>>>>>>>>>>>>>>>>> makes about 12,000 lines 
>>>>>>>>>>>>>>>>>>>>>>>>>> of training text.  It's amazing that this thing 
>>>>>>>>>>>>>>>>>>>>>>>>>> generates a good reader out 
>>>>>>>>>>>>>>>>>>>>>>>>>> of nowhere.  But then it is not very good.  For 
>>>>>>>>>>>>>>>>>>>>>>>>>> example:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134;
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> is a result on the image attached.  It's close 
>>>>>>>>>>>>>>>>>>>>>>>>>> but the last '<' in the result text doesn't exist on 
>>>>>>>>>>>>>>>>>>>>>>>>>> the image.  It's a 
>>>>>>>>>>>>>>>>>>>>>>>>>> small failure but it causes a greater trouble in 
>>>>>>>>>>>>>>>>>>>>>>>>>> parsing.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> What would you suggest from here to increase 
>>>>>>>>>>>>>>>>>>>>>>>>>> accuracy?  
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>    - Increase the number of lines in the 
>>>>>>>>>>>>>>>>>>>>>>>>>>    training text
>>>>>>>>>>>>>>>>>>>>>>>>>>    - Mix up more variations in the training text
>>>>>>>>>>>>>>>>>>>>>>>>>>    - Increase the number of iterations
>>>>>>>>>>>>>>>>>>>>>>>>>>    - Investigate wrong reads one by one
>>>>>>>>>>>>>>>>>>>>>>>>>>    - Or else?
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Also, I referred to engrestrict*.* and could 
>>>>>>>>>>>>>>>>>>>>>>>>>> generate similar result with the 
>>>>>>>>>>>>>>>>>>>>>>>>>> fine-tuning-from-full method.  It seems a 
>>>>>>>>>>>>>>>>>>>>>>>>>> bit faster to get to the same level but it also 
>>>>>>>>>>>>>>>>>>>>>>>>>> stops at a 'good' level.  I 
>>>>>>>>>>>>>>>>>>>>>>>>>> can go with either way if it takes me to the bright 
>>>>>>>>>>>>>>>>>>>>>>>>>> future.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks a lot, Shree. I'll look it in.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> See 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Look at the files engrestrict*.* and also 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Create training text of about 100 lines and 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> finetune for 400 lines 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I had about 14 lines as attached.  How many 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> lines would you recommend?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Fine tuning gives much better result but it 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tends to pick other character than in E13B that 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> only has 14 characters, 0 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> through 9 and 4 symbols.  I thought training from 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> scratch would eliminate 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> such confusion.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> For training from scratch a large training 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> text and hundreds of thousands of iterations are 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> recommended. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> If you are just fine tuning for a font try to 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> follow instructions for training for impact, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> with your font.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, <
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, Shree.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I saw the instruction.  The steps I 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> made are as follows:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Using tesstrain.sh:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/tesstrain.sh --fonts_dir 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /usr/share/fonts --lang eng --linedata_only \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --noextract_font_properties --langdata_dir 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ../langdata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --tessdata_dir ./tessdata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --fontlist "E13Bnsd" --output_dir 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --training_text 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ../langdata/eng/eng.training_e13b_text
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Training from scratch:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --debug_interval 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 100 \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Lfys48 Lfx96 Lrx96 Lfx256 O1c111]' \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --model_output 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base --learning_rate 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 20e-4 \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --train_listfile 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --eval_listfile 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --max_iterations 5000 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> &>~/tesstutorial/e13boutput/basetrain.log
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Test with base_checkpoint:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmeval --model 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --eval_listfile 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Combining output files:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --stop_training \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --continue_from 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>   --model_output 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/eng.traineddata
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Test with eng.traineddata:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract e13b.png out --tessdata-dir 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /home/koichi/tesstutorial/e13boutput
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The training from scratch ended as:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> At iteration 561/2500/2500, Mean rms=0.159%, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> delta=0%, char train=0%, word train=0%, skip 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ratio=0%,  New best char error 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> = 0 wrote best 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>  wrote 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> checkpoint.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The test with base_checkpoint returns 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> nothing as:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> At iteration 0, stage 0, Eval Char error 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> rate=0, Word error rate=0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> The test with eng.traineddata and e13b.png 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> returns out.txt.  Both files are attached.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Training seems to have worked fine.  I don't 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> know how to translate the test result from 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint.  The generated 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> eng.traineddata obviously doesn't work well. I 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> suspect the choice of 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata in combining output files is bad 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> but I have no clue.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> BTW, I referred to your tess4training in the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> process.  It helped a lot.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月29日水曜日 19
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7a0cc7c1-2b03-4609-96ba-6ef89a79ed50%40googlegroups.com.

Re: [tesseract-ocr] Trained data for E13B font

Reply via email to