Re: [tesseract-ocr] Trained data for E13B font

'Mamadou' via tesseract-ocr Wed, 07 Aug 2019 05:11:44 -0700


On Wednesday, August 7, 2019 at 2:36:52 AM UTC+2, ElGato ElMago wrote:
>
> HI,
>
> I'm thinking of sharing it of course.  What is the best way to do it?  
> After all this, the contribution part of mine is only how I prepared the 
> training text.  Even that is consist of Shree's text and mine.  The 
> instructions and tools I used already exist.
>
If you have a Github account just create a repo and publish the data and 
instructions.


>
> ElMagoElGato
>
> 2019年8月7日水曜日 8時20分02秒 UTC+9 Mamadou:
>
>> Hello,
>> Are you planning to release the dataset or models?
>> I'm working on the same subject and planning to share both under BSD terms
>>
>> On Tuesday, August 6, 2019 at 10:11:40 AM UTC+2, ElGato ElMago wrote:
>>>
>>> Hi,
>>>
>>> FWIW, I got to the point where I can feel happy with the accuracy. As 
>>> the images of the previous post show, the symbols, especially on-us symbol 
>>> and amount symbol, were causing mix-up each other or to another character.  
>>> I added much more more symbols to the training text and formed words that 
>>> start with a symbol.  One example is as follows:
>>>
>>> 9;:;=;<;< <0<1<3<4;6;8;9;:;=;
>>>
>>>
>>> I randomly made 8,000 lines like this.  In fine-tuning from eng, 5,000 
>>> iteration was almost good.  Amount symbol still is confused a little when 
>>> it's followed by 0.  Fine tuning tends to be dragged by small particles.  
>>> I'll have to think of something to make further improvement.
>>>
>>> Training from scratch produced a bit more stable traineddata.  It 
>>> doesn't get confused with symbols so often but tends to generate extra 
>>> spaces.  By 10,000 iterations, those spaces are gone and recognition became 
>>> very solid.
>>>
>>> I thought I might have to do image and box file training but I guess 
>>> it's not needed this time.
>>>
>>> ElMagoElGato
>>>
>>> 2019年7月26日金曜日 14時08分06秒 UTC+9 ElGato ElMago:
>>>>
>>>> HI,
>>>>
>>>> Well, I read the description of ScrollView (
>>>> https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging) and 
>>>> it says:
>>>>
>>>> To show the characters, deselect DISPLAY/Bounding Boxes, select 
>>>> DISPLAY/Polygonal Approx and then select OTHER/Uniform display.
>>>>
>>>>
>>>> It basically works.  But for some reason, it doesn't work on my e13b 
>>>> image and ends up with a blue screen.  Anyway, it shows each box 
>>>> separately 
>>>> when a character is consist of multiple boxes.  I'd like to show the box 
>>>> for the whole character.  ScrollView doesn't do it, at least, yet.  I'll 
>>>> do 
>>>> it on my own.
>>>>
>>>> ElMagoElGato
>>>>
>>>> 2019年7月24日水曜日 14時10分46秒 UTC+9 ElGato ElMago:
>>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> I got this result from hocr.  This is where one of the phantom 
>>>>> characters comes from.
>>>>>
>>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 902 1262 933; x_conf 
>>>>> 98.864532'>&lt;</span>
>>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 904 1281 933; x_conf 
>>>>> 99.018097'>;</span>
>>>>>
>>>>>
>>>>> The firs character is the phantom.  It starts with the second 
>>>>> character that exists on x axis.  The first character only has 3 points 
>>>>> width.  I attach ScrollView screen shots that visualize this.
>>>>>
>>>>> [image: 2019-07-24-132643_854x707_scrot.png][image: 
>>>>> 2019-07-24-132800_854x707_scrot.png]
>>>>>
>>>>>
>>>>> There seem to be some more cases to cause phantom characters.  I'll 
>>>>> look them in.  But I have a trivial question now.  I made ScrollView show 
>>>>> these displays by accidentally clicking Display->Blamer menu.  There is 
>>>>> Bounding Boxes menu below but it ends up showing a blue screen though it 
>>>>> briefly shows boxes on the way.  Can I use this menu at all?  It'll be 
>>>>> very 
>>>>> useful.
>>>>>
>>>>> [image: 2019-07-24-140739_854x707_scrot.png]
>>>>>
>>>>>
>>>>> 2019年7月23日火曜日 17時10分36秒 UTC+9 ElGato ElMago:
>>>>>>
>>>>>> It's great! Perfect!  Thanks a lot!
>>>>>>
>>>>>> 2019年7月23日火曜日 10時56分58秒 UTC+9 shree:
>>>>>>>
>>>>>>> See https://github.com/tesseract-ocr/tesseract/issues/2580
>>>>>>>
>>>>>>> On Tue, 23 Jul 2019, 06:23 ElGato ElMago, <elmago...@gmail.com> 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I read the output of hocr with lstm_choice_mode = 4 as to the pull 
>>>>>>>> request 2554.  It shows the candidates for each character but doesn't 
>>>>>>>> show 
>>>>>>>> bounding box of each character.  I only shows the box for a whole word.
>>>>>>>>
>>>>>>>> I see bounding boxes of each character in comments of the pull 
>>>>>>>> request 2576.  How can I do that?  Do I have to look in the source 
>>>>>>>> code and 
>>>>>>>> manipulate such an output on my own?
>>>>>>>>
>>>>>>>> 2019年7月19日金曜日 18時40分49秒 UTC+9 ElGato ElMago:
>>>>>>>>
>>>>>>>>> Lorenzo,
>>>>>>>>>
>>>>>>>>> I haven't been checking psm too much.  Will turn to those options 
>>>>>>>>> after I see how it goes with bounding boxes.
>>>>>>>>>
>>>>>>>>> Shree,
>>>>>>>>>
>>>>>>>>> I see the merges in the git log and also see that new 
>>>>>>>>> option lstm_choice_amount works now.  I guess my executable is latest 
>>>>>>>>> though I still see the phantom character.  Hocr makes huge and 
>>>>>>>>> complex 
>>>>>>>>> output.  I'll take some to read it.
>>>>>>>>>
>>>>>>>>> 2019年7月19日金曜日 18時20分55秒 UTC+9 Claudiu:
>>>>>>>>>>
>>>>>>>>>> Is there any way to pass bounding boxes to use to the LSTM? We 
>>>>>>>>>> have an algorithm that cleanly gets bounding boxes of MRZ 
>>>>>>>>>> characters. 
>>>>>>>>>> However the results using psm 10 are worse than passing the whole 
>>>>>>>>>> line in. 
>>>>>>>>>> Yet when we pass the whole line in we get these phantom characters. 
>>>>>>>>>>
>>>>>>>>>> Should PSM 10 mode work? It often returns “no character” where 
>>>>>>>>>> there clearly is one. I can supply a test case if it is expected to 
>>>>>>>>>> work 
>>>>>>>>>> well. 
>>>>>>>>>>
>>>>>>>>>> On Fri, Jul 19, 2019 at 11:06 AM ElGato ElMago <
>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Lorenzo,
>>>>>>>>>>>
>>>>>>>>>>> We both have got the same case.  It seems a solution to this 
>>>>>>>>>>> problem would save a lot of people.
>>>>>>>>>>>
>>>>>>>>>>> Shree,
>>>>>>>>>>>
>>>>>>>>>>> I pulled the current head of master branch but it doesn't seem 
>>>>>>>>>>> to contain the merges you pointed that have been merged 3 to 4 days 
>>>>>>>>>>> ago.  
>>>>>>>>>>> How can I get them?
>>>>>>>>>>>
>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>
>>>>>>>>>>> 2019年7月19日金曜日 17時02分53秒 UTC+9 Lorenzo Blz:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> PSM 7 was a partial solution for my specific case, it improved 
>>>>>>>>>>>> the situation but did not solve it. Also I could not use it in 
>>>>>>>>>>>> some other 
>>>>>>>>>>>> cases.
>>>>>>>>>>>>
>>>>>>>>>>>> The proper solution is very likely doing more training with 
>>>>>>>>>>>> more data, some data augmentation might probably help if data is 
>>>>>>>>>>>> scarce.
>>>>>>>>>>>> Also doing less training might help is the training is not done 
>>>>>>>>>>>> correctly.
>>>>>>>>>>>>
>>>>>>>>>>>> There are also similar issues on github:
>>>>>>>>>>>>
>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1465
>>>>>>>>>>>> ...
>>>>>>>>>>>>
>>>>>>>>>>>> The LSTM engine works like this: it scans the image and for 
>>>>>>>>>>>> each "pixel column" does this:
>>>>>>>>>>>>
>>>>>>>>>>>> M M M M N M M M [BLANK] F F F F
>>>>>>>>>>>>
>>>>>>>>>>>> (here i report only the highest probability characters)
>>>>>>>>>>>>
>>>>>>>>>>>> In the example above an M is partially seen as an N, this is 
>>>>>>>>>>>> normal, and another step of the algorithm (beam search I think) 
>>>>>>>>>>>> tries to 
>>>>>>>>>>>> aggregate back the correct characters.
>>>>>>>>>>>>
>>>>>>>>>>>> I think cases like this:
>>>>>>>>>>>>
>>>>>>>>>>>> M M M N N N M M
>>>>>>>>>>>>
>>>>>>>>>>>> are what gives the phantom characters. More training should 
>>>>>>>>>>>> reduce the source of the problem or a painful analysis of the 
>>>>>>>>>>>> bounding 
>>>>>>>>>>>> boxes might fix some cases.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I used the attached script for the boxes.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Lorenzo
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Il giorno ven 19 lug 2019 alle ore 07:25 ElGato ElMago <
>>>>>>>>>>>> elmago...@gmail.com> ha scritto:
>>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Let's call them phantom characters then.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Was psm 7 the solution for the issue 1778?  None of the psm 
>>>>>>>>>>>>> option didn't solve my problem though I see different output.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I use tesseract 5.0-alpha mostly but 4.1 showed the same 
>>>>>>>>>>>>> results anyway.  How did you get bounding box for each character? 
>>>>>>>>>>>>>  Alto and 
>>>>>>>>>>>>> lstmbox only show bbox for a group of characters.
>>>>>>>>>>>>>
>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Phantom characters here for me too:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1778
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Are you using 4.1? Bounding boxes were fixed in 4.1 maybe 
>>>>>>>>>>>>>> this was also improved.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I wrote some code that uses symbols iterator to discard 
>>>>>>>>>>>>>> symbols that are clearly duplicated: too small, overlapping, 
>>>>>>>>>>>>>> etc. But it 
>>>>>>>>>>>>>> was not easy to make it work decently and it is not 100% 
>>>>>>>>>>>>>> reliable with 
>>>>>>>>>>>>>> false negatives and positives. I cannot share the code and it is 
>>>>>>>>>>>>>> quite ugly 
>>>>>>>>>>>>>> anyway.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here there is another MRZ model with training data:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://github.com/DoubangoTelecom/tesseractMRZ
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Lorenzo
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu <
>>>>>>>>>>>>>> csaf...@gmail.com> ha scritto:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I’m getting the “phantom character” issue as well using the 
>>>>>>>>>>>>>>> OCRB that Shree trained on MRZ lines. For example for a 0 it 
>>>>>>>>>>>>>>> will sometimes 
>>>>>>>>>>>>>>> add both a 0 and an O to the output , thus outputting 45 
>>>>>>>>>>>>>>> characters total 
>>>>>>>>>>>>>>> instead of 44. I haven’t looked at the bounding box output yet 
>>>>>>>>>>>>>>> but I 
>>>>>>>>>>>>>>> suspect a phantom thin character is added somewhere that I can 
>>>>>>>>>>>>>>> discard .. 
>>>>>>>>>>>>>>> or maybe two chars will have the same bounding box. If anyone 
>>>>>>>>>>>>>>> else has 
>>>>>>>>>>>>>>> fixed this issue further up (eg so the output doesn’t contain 
>>>>>>>>>>>>>>> the phantom 
>>>>>>>>>>>>>>> characters in the first place) id be interested. 
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago <
>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'll go back to more of training later.  Before doing so, 
>>>>>>>>>>>>>>>> I'd like to investigate results a little bit.  The hocr and 
>>>>>>>>>>>>>>>> lstmbox options 
>>>>>>>>>>>>>>>> give some details of positions of characters.  The results 
>>>>>>>>>>>>>>>> show positions 
>>>>>>>>>>>>>>>> that perfectly correspond to letters in the image.  But the 
>>>>>>>>>>>>>>>> text output 
>>>>>>>>>>>>>>>> contains a character that obviously does not exist.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Then I found a config file 'lstmdebug' that generates far 
>>>>>>>>>>>>>>>> more information.  I hope it explains what happened with each 
>>>>>>>>>>>>>>>> character.  
>>>>>>>>>>>>>>>> I'm yet to read the debug output but I'd appreciate it if 
>>>>>>>>>>>>>>>> someone could 
>>>>>>>>>>>>>>>> tell me how to read it because it's really complex.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2019年6月14日金曜日 19時58分49秒 UTC+9 shree:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_MICR
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I have uploaded my files there. 
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh
>>>>>>>>>>>>>>>>> is the bash script that runs the training.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> You can modify as needed. Please note this is for 
>>>>>>>>>>>>>>>>> legacy/base tesseract --oem 0.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago <
>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks a lot, shree.  It seems you know everything.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> I tried the MICR0.traineddata and the first two 
>>>>>>>>>>>>>>>>>> mcr.traineddata.  The last one was blocked by the browser.  
>>>>>>>>>>>>>>>>>> Each of the 
>>>>>>>>>>>>>>>>>> traineddata had mixed results.  All of them are getting 
>>>>>>>>>>>>>>>>>> symbols fairly good 
>>>>>>>>>>>>>>>>>> but getting spaces randomly and reading some numbers wrong.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> MICR0 seems the best among them.  Did you suggest that 
>>>>>>>>>>>>>>>>>> you'd be able to update it?  It gets tripple D very often 
>>>>>>>>>>>>>>>>>> where there's 
>>>>>>>>>>>>>>>>>> only one, and so on.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Also, I tried to fine tune from MICR0 but I found that I 
>>>>>>>>>>>>>>>>>> need to change the language-specific.sh.  It specifies some 
>>>>>>>>>>>>>>>>>> parameters for 
>>>>>>>>>>>>>>>>>> each language.  Do you have any guidance for it?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> see 
>>>>>>>>>>>>>>>>>>> http://www.devscope.net/Content/ocrchecks.aspx 
>>>>>>>>>>>>>>>>>>> https://github.com/BigPino67/Tesseract-MICR-OCR
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ
>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago <
>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> That'll be nice if there's traineddata out there but I 
>>>>>>>>>>>>>>>>>>>> didn't find any.  I see free fonts and commercial OCR 
>>>>>>>>>>>>>>>>>>>> software but not 
>>>>>>>>>>>>>>>>>>>> traineddata.  Tessdata repository obviously doesn't have 
>>>>>>>>>>>>>>>>>>>> one, either.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Please also search for existing MICR traineddata files.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> So I did several tests from scratch.  In the last 
>>>>>>>>>>>>>>>>>>>>>> attempt, I made a training text with 4,000 lines in the 
>>>>>>>>>>>>>>>>>>>>>> following format,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> 110004310510<   <02 :4002=0181:801= 0008752 <00039 
>>>>>>>>>>>>>>>>>>>>>> ;0000001000;
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> and combined it with eng.digits.training_text in 
>>>>>>>>>>>>>>>>>>>>>> which symbols are converted to E13B symbols.  This makes 
>>>>>>>>>>>>>>>>>>>>>> about 12,000 lines 
>>>>>>>>>>>>>>>>>>>>>> of training text.  It's amazing that this thing 
>>>>>>>>>>>>>>>>>>>>>> generates a good reader out 
>>>>>>>>>>>>>>>>>>>>>> of nowhere.  But then it is not very good.  For example:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134;
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> is a result on the image attached.  It's close but 
>>>>>>>>>>>>>>>>>>>>>> the last '<' in the result text doesn't exist on the 
>>>>>>>>>>>>>>>>>>>>>> image.  It's a small 
>>>>>>>>>>>>>>>>>>>>>> failure but it causes a greater trouble in parsing.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> What would you suggest from here to increase 
>>>>>>>>>>>>>>>>>>>>>> accuracy?  
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>    - Increase the number of lines in the training 
>>>>>>>>>>>>>>>>>>>>>>    text
>>>>>>>>>>>>>>>>>>>>>>    - Mix up more variations in the training text
>>>>>>>>>>>>>>>>>>>>>>    - Increase the number of iterations
>>>>>>>>>>>>>>>>>>>>>>    - Investigate wrong reads one by one
>>>>>>>>>>>>>>>>>>>>>>    - Or else?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Also, I referred to engrestrict*.* and could generate 
>>>>>>>>>>>>>>>>>>>>>> similar result with the fine-tuning-from-full method.  
>>>>>>>>>>>>>>>>>>>>>> It seems a bit 
>>>>>>>>>>>>>>>>>>>>>> faster to get to the same level but it also stops at a 
>>>>>>>>>>>>>>>>>>>>>> 'good' level.  I can 
>>>>>>>>>>>>>>>>>>>>>> go with either way if it takes me to the bright future.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks a lot, Shree. I'll look it in.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> See 
>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Look at the files engrestrict*.* and also 
>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Create training text of about 100 lines and 
>>>>>>>>>>>>>>>>>>>>>>>> finetune for 400 lines 
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> I had about 14 lines as attached.  How many lines 
>>>>>>>>>>>>>>>>>>>>>>>>> would you recommend?
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Fine tuning gives much better result but it tends 
>>>>>>>>>>>>>>>>>>>>>>>>> to pick other character than in E13B that only has 14 
>>>>>>>>>>>>>>>>>>>>>>>>> characters, 0 through 
>>>>>>>>>>>>>>>>>>>>>>>>> 9 and 4 symbols.  I thought training from scratch 
>>>>>>>>>>>>>>>>>>>>>>>>> would eliminate such 
>>>>>>>>>>>>>>>>>>>>>>>>> confusion.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> For training from scratch a large training text 
>>>>>>>>>>>>>>>>>>>>>>>>>> and hundreds of thousands of iterations are 
>>>>>>>>>>>>>>>>>>>>>>>>>> recommended. 
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> If you are just fine tuning for a font try to 
>>>>>>>>>>>>>>>>>>>>>>>>>> follow instructions for training for impact, with 
>>>>>>>>>>>>>>>>>>>>>>>>>> your font.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, <
>>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, Shree.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I saw the instruction.  The steps I made 
>>>>>>>>>>>>>>>>>>>>>>>>>>> are as follows:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Using tesstrain.sh:
>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/tesstrain.sh --fonts_dir 
>>>>>>>>>>>>>>>>>>>>>>>>>>> /usr/share/fonts --lang eng --linedata_only \
>>>>>>>>>>>>>>>>>>>>>>>>>>>   --noextract_font_properties --langdata_dir 
>>>>>>>>>>>>>>>>>>>>>>>>>>> ../langdata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>   --tessdata_dir ./tessdata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>   --fontlist "E13Bnsd" --output_dir 
>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval \
>>>>>>>>>>>>>>>>>>>>>>>>>>>   --training_text 
>>>>>>>>>>>>>>>>>>>>>>>>>>> ../langdata/eng/eng.training_e13b_text
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Training from scratch:
>>>>>>>>>>>>>>>>>>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput
>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --debug_interval 100 \
>>>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>   --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Lfx96 Lrx96 Lfx256 O1c111]' \
>>>>>>>>>>>>>>>>>>>>>>>>>>>   --model_output ~/tesstutorial/e13boutput/base 
>>>>>>>>>>>>>>>>>>>>>>>>>>> --learning_rate 20e-4 \
>>>>>>>>>>>>>>>>>>>>>>>>>>>   --train_listfile 
>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>>>>>>>>>>>>>>>>>>>>>>>>   --eval_listfile 
>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>>>>>>>>>>>>>>>>>>>>>>>>   --max_iterations 5000 
>>>>>>>>>>>>>>>>>>>>>>>>>>> &>~/tesstutorial/e13boutput/basetrain.log
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Test with base_checkpoint:
>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmeval --model 
>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>   --eval_listfile 
>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Combining output files:
>>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --stop_training \
>>>>>>>>>>>>>>>>>>>>>>>>>>>   --continue_from 
>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>>>   --model_output 
>>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/eng.traineddata
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Test with eng.traineddata:
>>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract e13b.png out --tessdata-dir 
>>>>>>>>>>>>>>>>>>>>>>>>>>> /home/koichi/tesstutorial/e13boutput
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> The training from scratch ended as:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> At iteration 561/2500/2500, Mean rms=0.159%, 
>>>>>>>>>>>>>>>>>>>>>>>>>>> delta=0%, char train=0%, word train=0%, skip 
>>>>>>>>>>>>>>>>>>>>>>>>>>> ratio=0%,  New best char error 
>>>>>>>>>>>>>>>>>>>>>>>>>>> = 0 wrote best 
>>>>>>>>>>>>>>>>>>>>>>>>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint
>>>>>>>>>>>>>>>>>>>>>>>>>>>  wrote 
>>>>>>>>>>>>>>>>>>>>>>>>>>> checkpoint.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> The test with base_checkpoint returns nothing as:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> At iteration 0, stage 0, Eval Char error rate=0, 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Word error rate=0
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> The test with eng.traineddata and e13b.png 
>>>>>>>>>>>>>>>>>>>>>>>>>>> returns out.txt.  Both files are attached.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Training seems to have worked fine.  I don't 
>>>>>>>>>>>>>>>>>>>>>>>>>>> know how to translate the test result from 
>>>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint.  The generated 
>>>>>>>>>>>>>>>>>>>>>>>>>>> eng.traineddata obviously doesn't work well. I 
>>>>>>>>>>>>>>>>>>>>>>>>>>> suspect the choice of 
>>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata in combining output files is bad but 
>>>>>>>>>>>>>>>>>>>>>>>>>>> I have no clue.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> BTW, I referred to your tess4training in the 
>>>>>>>>>>>>>>>>>>>>>>>>>>> process.  It helped a lot.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月29日水曜日 19時14分08秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> see 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, May 29, 2019 at 3:18 PM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I wish to make a trained data for E13B font.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> I read the training tutorial and made a 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint file according to the method in 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Training From Scratch.  
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Now, how can I make a trained data from the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint file?</di
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4a85b67c-c9fe-47b9-94e3-576e2ebc89e3%40googlegroups.com.

Re: [tesseract-ocr] Trained data for E13B font

Reply via email to