Re: [tesseract-ocr] Trained data for E13B font

ElGato ElMago Tue, 06 Aug 2019 17:37:39 -0700

HI,

I'm thinking of sharing it of course.  What is the best way to do it?  
After all this, the contribution part of mine is only how I prepared the 
training text.  Even that is consist of Shree's text and mine.  The 
instructions and tools I used already exist.


ElMagoElGato

2019年8月7日水曜日 8時20分02秒 UTC+9 Mamadou:

> Hello,
> Are you planning to release the dataset or models?
> I'm working on the same subject and planning to share both under BSD terms
>
> On Tuesday, August 6, 2019 at 10:11:40 AM UTC+2, ElGato ElMago wrote:
>>
>> Hi,
>>
>> FWIW, I got to the point where I can feel happy with the accuracy. As the 
>> images of the previous post show, the symbols, especially on-us symbol and 
>> amount symbol, were causing mix-up each other or to another character.  I 
>> added much more more symbols to the training text and formed words that 
>> start with a symbol.  One example is as follows:
>>
>> 9;:;=;<;< <0<1<3<4;6;8;9;:;=;
>>
>>
>> I randomly made 8,000 lines like this.  In fine-tuning from eng, 5,000 
>> iteration was almost good.  Amount symbol still is confused a little when 
>> it's followed by 0.  Fine tuning tends to be dragged by small particles.  
>> I'll have to think of something to make further improvement.
>>
>> Training from scratch produced a bit more stable traineddata.  It doesn't 
>> get confused with symbols so often but tends to generate extra spaces.  By 
>> 10,000 iterations, those spaces are gone and recognition became very solid.
>>
>> I thought I might have to do image and box file training but I guess it's 
>> not needed this time.
>>
>> ElMagoElGato
>>
>> 2019年7月26日金曜日 14時08分06秒 UTC+9 ElGato ElMago:
>>>
>>> HI,
>>>
>>> Well, I read the description of ScrollView (
>>> https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging) and it 
>>> says:
>>>
>>> To show the characters, deselect DISPLAY/Bounding Boxes, select 
>>> DISPLAY/Polygonal Approx and then select OTHER/Uniform display.
>>>
>>>
>>> It basically works.  But for some reason, it doesn't work on my e13b 
>>> image and ends up with a blue screen.  Anyway, it shows each box separately 
>>> when a character is consist of multiple boxes.  I'd like to show the box 
>>> for the whole character.  ScrollView doesn't do it, at least, yet.  I'll do 
>>> it on my own.
>>>
>>> ElMagoElGato
>>>
>>> 2019年7月24日水曜日 14時10分46秒 UTC+9 ElGato ElMago:
>>>>
>>>> Hi,
>>>>
>>>>
>>>> I got this result from hocr.  This is where one of the phantom 
>>>> characters comes from.
>>>>
>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 902 1262 933; x_conf 
>>>> 98.864532'>&lt;</span>
>>>> <span class='ocrx_cinfo' title='x_bboxes 1259 904 1281 933; x_conf 
>>>> 99.018097'>;</span>
>>>>
>>>>
>>>> The firs character is the phantom.  It starts with the second character 
>>>> that exists on x axis.  The first character only has 3 points width.  I 
>>>> attach ScrollView screen shots that visualize this.
>>>>
>>>> [image: 2019-07-24-132643_854x707_scrot.png][image: 
>>>> 2019-07-24-132800_854x707_scrot.png]
>>>>
>>>>
>>>> There seem to be some more cases to cause phantom characters.  I'll 
>>>> look them in.  But I have a trivial question now.  I made ScrollView show 
>>>> these displays by accidentally clicking Display->Blamer menu.  There is 
>>>> Bounding Boxes menu below but it ends up showing a blue screen though it 
>>>> briefly shows boxes on the way.  Can I use this menu at all?  It'll be 
>>>> very 
>>>> useful.
>>>>
>>>> [image: 2019-07-24-140739_854x707_scrot.png]
>>>>
>>>>
>>>> 2019年7月23日火曜日 17時10分36秒 UTC+9 ElGato ElMago:
>>>>>
>>>>> It's great! Perfect!  Thanks a lot!
>>>>>
>>>>> 2019年7月23日火曜日 10時56分58秒 UTC+9 shree:
>>>>>>
>>>>>> See https://github.com/tesseract-ocr/tesseract/issues/2580
>>>>>>
>>>>>> On Tue, 23 Jul 2019, 06:23 ElGato ElMago, <elmago...@gmail.com> 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I read the output of hocr with lstm_choice_mode = 4 as to the pull 
>>>>>>> request 2554.  It shows the candidates for each character but doesn't 
>>>>>>> show 
>>>>>>> bounding box of each character.  I only shows the box for a whole word.
>>>>>>>
>>>>>>> I see bounding boxes of each character in comments of the pull 
>>>>>>> request 2576.  How can I do that?  Do I have to look in the source code 
>>>>>>> and 
>>>>>>> manipulate such an output on my own?
>>>>>>>
>>>>>>> 2019年7月19日金曜日 18時40分49秒 UTC+9 ElGato ElMago:
>>>>>>>
>>>>>>>> Lorenzo,
>>>>>>>>
>>>>>>>> I haven't been checking psm too much.  Will turn to those options 
>>>>>>>> after I see how it goes with bounding boxes.
>>>>>>>>
>>>>>>>> Shree,
>>>>>>>>
>>>>>>>> I see the merges in the git log and also see that new 
>>>>>>>> option lstm_choice_amount works now.  I guess my executable is latest 
>>>>>>>> though I still see the phantom character.  Hocr makes huge and complex 
>>>>>>>> output.  I'll take some to read it.
>>>>>>>>
>>>>>>>> 2019年7月19日金曜日 18時20分55秒 UTC+9 Claudiu:
>>>>>>>>>
>>>>>>>>> Is there any way to pass bounding boxes to use to the LSTM? We 
>>>>>>>>> have an algorithm that cleanly gets bounding boxes of MRZ characters. 
>>>>>>>>> However the results using psm 10 are worse than passing the whole 
>>>>>>>>> line in. 
>>>>>>>>> Yet when we pass the whole line in we get these phantom characters. 
>>>>>>>>>
>>>>>>>>> Should PSM 10 mode work? It often returns “no character” where 
>>>>>>>>> there clearly is one. I can supply a test case if it is expected to 
>>>>>>>>> work 
>>>>>>>>> well. 
>>>>>>>>>
>>>>>>>>> On Fri, Jul 19, 2019 at 11:06 AM ElGato ElMago <
>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Lorenzo,
>>>>>>>>>>
>>>>>>>>>> We both have got the same case.  It seems a solution to this 
>>>>>>>>>> problem would save a lot of people.
>>>>>>>>>>
>>>>>>>>>> Shree,
>>>>>>>>>>
>>>>>>>>>> I pulled the current head of master branch but it doesn't seem to 
>>>>>>>>>> contain the merges you pointed that have been merged 3 to 4 days 
>>>>>>>>>> ago.  How 
>>>>>>>>>> can I get them?
>>>>>>>>>>
>>>>>>>>>> ElMagoElGato
>>>>>>>>>>
>>>>>>>>>> 2019年7月19日金曜日 17時02分53秒 UTC+9 Lorenzo Blz:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> PSM 7 was a partial solution for my specific case, it improved 
>>>>>>>>>>> the situation but did not solve it. Also I could not use it in some 
>>>>>>>>>>> other 
>>>>>>>>>>> cases.
>>>>>>>>>>>
>>>>>>>>>>> The proper solution is very likely doing more training with more 
>>>>>>>>>>> data, some data augmentation might probably help if data is scarce.
>>>>>>>>>>> Also doing less training might help is the training is not done 
>>>>>>>>>>> correctly.
>>>>>>>>>>>
>>>>>>>>>>> There are also similar issues on github:
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1465
>>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>> The LSTM engine works like this: it scans the image and for each 
>>>>>>>>>>> "pixel column" does this:
>>>>>>>>>>>
>>>>>>>>>>> M M M M N M M M [BLANK] F F F F
>>>>>>>>>>>
>>>>>>>>>>> (here i report only the highest probability characters)
>>>>>>>>>>>
>>>>>>>>>>> In the example above an M is partially seen as an N, this is 
>>>>>>>>>>> normal, and another step of the algorithm (beam search I think) 
>>>>>>>>>>> tries to 
>>>>>>>>>>> aggregate back the correct characters.
>>>>>>>>>>>
>>>>>>>>>>> I think cases like this:
>>>>>>>>>>>
>>>>>>>>>>> M M M N N N M M
>>>>>>>>>>>
>>>>>>>>>>> are what gives the phantom characters. More training should 
>>>>>>>>>>> reduce the source of the problem or a painful analysis of the 
>>>>>>>>>>> bounding 
>>>>>>>>>>> boxes might fix some cases.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I used the attached script for the boxes.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Lorenzo
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Il giorno ven 19 lug 2019 alle ore 07:25 ElGato ElMago <
>>>>>>>>>>> elmago...@gmail.com> ha scritto:
>>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> Let's call them phantom characters then.
>>>>>>>>>>>>
>>>>>>>>>>>> Was psm 7 the solution for the issue 1778?  None of the psm 
>>>>>>>>>>>> option didn't solve my problem though I see different output.
>>>>>>>>>>>>
>>>>>>>>>>>> I use tesseract 5.0-alpha mostly but 4.1 showed the same 
>>>>>>>>>>>> results anyway.  How did you get bounding box for each character?  
>>>>>>>>>>>> Alto and 
>>>>>>>>>>>> lstmbox only show bbox for a group of characters.
>>>>>>>>>>>>
>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>
>>>>>>>>>>>> 2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz:
>>>>>>>>>>>>
>>>>>>>>>>>>> Phantom characters here for me too:
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1778
>>>>>>>>>>>>>
>>>>>>>>>>>>> Are you using 4.1? Bounding boxes were fixed in 4.1 maybe this 
>>>>>>>>>>>>> was also improved.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I wrote some code that uses symbols iterator to discard 
>>>>>>>>>>>>> symbols that are clearly duplicated: too small, overlapping, etc. 
>>>>>>>>>>>>> But it 
>>>>>>>>>>>>> was not easy to make it work decently and it is not 100% reliable 
>>>>>>>>>>>>> with 
>>>>>>>>>>>>> false negatives and positives. I cannot share the code and it is 
>>>>>>>>>>>>> quite ugly 
>>>>>>>>>>>>> anyway.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here there is another MRZ model with training data:
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://github.com/DoubangoTelecom/tesseractMRZ
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Lorenzo
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu <
>>>>>>>>>>>>> csaf...@gmail.com> ha scritto:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I’m getting the “phantom character” issue as well using the 
>>>>>>>>>>>>>> OCRB that Shree trained on MRZ lines. For example for a 0 it 
>>>>>>>>>>>>>> will sometimes 
>>>>>>>>>>>>>> add both a 0 and an O to the output , thus outputting 45 
>>>>>>>>>>>>>> characters total 
>>>>>>>>>>>>>> instead of 44. I haven’t looked at the bounding box output yet 
>>>>>>>>>>>>>> but I 
>>>>>>>>>>>>>> suspect a phantom thin character is added somewhere that I can 
>>>>>>>>>>>>>> discard .. 
>>>>>>>>>>>>>> or maybe two chars will have the same bounding box. If anyone 
>>>>>>>>>>>>>> else has 
>>>>>>>>>>>>>> fixed this issue further up (eg so the output doesn’t contain 
>>>>>>>>>>>>>> the phantom 
>>>>>>>>>>>>>> characters in the first place) id be interested. 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago <
>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'll go back to more of training later.  Before doing so, 
>>>>>>>>>>>>>>> I'd like to investigate results a little bit.  The hocr and 
>>>>>>>>>>>>>>> lstmbox options 
>>>>>>>>>>>>>>> give some details of positions of characters.  The results show 
>>>>>>>>>>>>>>> positions 
>>>>>>>>>>>>>>> that perfectly correspond to letters in the image.  But the 
>>>>>>>>>>>>>>> text output 
>>>>>>>>>>>>>>> contains a character that obviously does not exist.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Then I found a config file 'lstmdebug' that generates far 
>>>>>>>>>>>>>>> more information.  I hope it explains what happened with each 
>>>>>>>>>>>>>>> character.  
>>>>>>>>>>>>>>> I'm yet to read the debug output but I'd appreciate it if 
>>>>>>>>>>>>>>> someone could 
>>>>>>>>>>>>>>> tell me how to read it because it's really complex.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019年6月14日金曜日 19時58分49秒 UTC+9 shree:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_MICR
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have uploaded my files there. 
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh
>>>>>>>>>>>>>>>> is the bash script that runs the training.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> You can modify as needed. Please note this is for 
>>>>>>>>>>>>>>>> legacy/base tesseract --oem 0.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago <
>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks a lot, shree.  It seems you know everything.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I tried the MICR0.traineddata and the first two 
>>>>>>>>>>>>>>>>> mcr.traineddata.  The last one was blocked by the browser.  
>>>>>>>>>>>>>>>>> Each of the 
>>>>>>>>>>>>>>>>> traineddata had mixed results.  All of them are getting 
>>>>>>>>>>>>>>>>> symbols fairly good 
>>>>>>>>>>>>>>>>> but getting spaces randomly and reading some numbers wrong.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> MICR0 seems the best among them.  Did you suggest that 
>>>>>>>>>>>>>>>>> you'd be able to update it?  It gets tripple D very often 
>>>>>>>>>>>>>>>>> where there's 
>>>>>>>>>>>>>>>>> only one, and so on.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Also, I tried to fine tune from MICR0 but I found that I 
>>>>>>>>>>>>>>>>> need to change the language-specific.sh.  It specifies some 
>>>>>>>>>>>>>>>>> parameters for 
>>>>>>>>>>>>>>>>> each language.  Do you have any guidance for it?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> see 
>>>>>>>>>>>>>>>>>> http://www.devscope.net/Content/ocrchecks.aspx 
>>>>>>>>>>>>>>>>>> https://github.com/BigPino67/Tesseract-MICR-OCR
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ
>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago <
>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> That'll be nice if there's traineddata out there but I 
>>>>>>>>>>>>>>>>>>> didn't find any.  I see free fonts and commercial OCR 
>>>>>>>>>>>>>>>>>>> software but not 
>>>>>>>>>>>>>>>>>>> traineddata.  Tessdata repository obviously doesn't have 
>>>>>>>>>>>>>>>>>>> one, either.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Please also search for existing MICR traineddata files.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> So I did several tests from scratch.  In the last 
>>>>>>>>>>>>>>>>>>>>> attempt, I made a training text with 4,000 lines in the 
>>>>>>>>>>>>>>>>>>>>> following format,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 110004310510<   <02 :4002=0181:801= 0008752 <00039 
>>>>>>>>>>>>>>>>>>>>> ;0000001000;
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> and combined it with eng.digits.training_text in which 
>>>>>>>>>>>>>>>>>>>>> symbols are converted to E13B symbols.  This makes about 
>>>>>>>>>>>>>>>>>>>>> 12,000 lines of 
>>>>>>>>>>>>>>>>>>>>> training text.  It's amazing that this thing generates a 
>>>>>>>>>>>>>>>>>>>>> good reader out of 
>>>>>>>>>>>>>>>>>>>>> nowhere.  But then it is not very good.  For example:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134;
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> is a result on the image attached.  It's close but the 
>>>>>>>>>>>>>>>>>>>>> last '<' in the result text doesn't exist on the image.  
>>>>>>>>>>>>>>>>>>>>> It's a small 
>>>>>>>>>>>>>>>>>>>>> failure but it causes a greater trouble in parsing.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> What would you suggest from here to increase 
>>>>>>>>>>>>>>>>>>>>> accuracy?  
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>    - Increase the number of lines in the training text
>>>>>>>>>>>>>>>>>>>>>    - Mix up more variations in the training text
>>>>>>>>>>>>>>>>>>>>>    - Increase the number of iterations
>>>>>>>>>>>>>>>>>>>>>    - Investigate wrong reads one by one
>>>>>>>>>>>>>>>>>>>>>    - Or else?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Also, I referred to engrestrict*.* and could generate 
>>>>>>>>>>>>>>>>>>>>> similar result with the fine-tuning-from-full method.  It 
>>>>>>>>>>>>>>>>>>>>> seems a bit 
>>>>>>>>>>>>>>>>>>>>> faster to get to the same level but it also stops at a 
>>>>>>>>>>>>>>>>>>>>> 'good' level.  I can 
>>>>>>>>>>>>>>>>>>>>> go with either way if it takes me to the bright future.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks a lot, Shree. I'll look it in.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_shreetest
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Look at the files engrestrict*.* and also 
>>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Create training text of about 100 lines and finetune 
>>>>>>>>>>>>>>>>>>>>>>> for 400 lines 
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I had about 14 lines as attached.  How many lines 
>>>>>>>>>>>>>>>>>>>>>>>> would you recommend?
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Fine tuning gives much better result but it tends 
>>>>>>>>>>>>>>>>>>>>>>>> to pick other character than in E13B that only has 14 
>>>>>>>>>>>>>>>>>>>>>>>> characters, 0 through 
>>>>>>>>>>>>>>>>>>>>>>>> 9 and 4 symbols.  I thought training from scratch 
>>>>>>>>>>>>>>>>>>>>>>>> would eliminate such 
>>>>>>>>>>>>>>>>>>>>>>>> confusion.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> For training from scratch a large training text 
>>>>>>>>>>>>>>>>>>>>>>>>> and hundreds of thousands of iterations are 
>>>>>>>>>>>>>>>>>>>>>>>>> recommended. 
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> If you are just fine tuning for a font try to 
>>>>>>>>>>>>>>>>>>>>>>>>> follow instructions for training for impact, with 
>>>>>>>>>>>>>>>>>>>>>>>>> your font.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, <
>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, Shree.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I saw the instruction.  The steps I made are 
>>>>>>>>>>>>>>>>>>>>>>>>>> as follows:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Using tesstrain.sh:
>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/tesstrain.sh --fonts_dir 
>>>>>>>>>>>>>>>>>>>>>>>>>> /usr/share/fonts --lang eng --linedata_only \
>>>>>>>>>>>>>>>>>>>>>>>>>>   --noextract_font_properties --langdata_dir 
>>>>>>>>>>>>>>>>>>>>>>>>>> ../langdata \
>>>>>>>>>>>>>>>>>>>>>>>>>>   --tessdata_dir ./tessdata \
>>>>>>>>>>>>>>>>>>>>>>>>>>   --fontlist "E13Bnsd" --output_dir 
>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval \
>>>>>>>>>>>>>>>>>>>>>>>>>>   --training_text 
>>>>>>>>>>>>>>>>>>>>>>>>>> ../langdata/eng/eng.training_e13b_text
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Training from scratch:
>>>>>>>>>>>>>>>>>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput
>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --debug_interval 100 \
>>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>>   --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 
>>>>>>>>>>>>>>>>>>>>>>>>>> Lfx96 Lrx96 Lfx256 O1c111]' \
>>>>>>>>>>>>>>>>>>>>>>>>>>   --model_output ~/tesstutorial/e13boutput/base 
>>>>>>>>>>>>>>>>>>>>>>>>>> --learning_rate 20e-4 \
>>>>>>>>>>>>>>>>>>>>>>>>>>   --train_listfile 
>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>>>>>>>>>>>>>>>>>>>>>>>   --eval_listfile 
>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>>>>>>>>>>>>>>>>>>>>>>>   --max_iterations 5000 
>>>>>>>>>>>>>>>>>>>>>>>>>> &>~/tesstutorial/e13boutput/basetrain.log
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Test with base_checkpoint:
>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmeval --model 
>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>>   --eval_listfile 
>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Combining output files:
>>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --stop_training \
>>>>>>>>>>>>>>>>>>>>>>>>>>   --continue_from 
>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>>   --model_output 
>>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/eng.traineddata
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Test with eng.traineddata:
>>>>>>>>>>>>>>>>>>>>>>>>>> tesseract e13b.png out --tessdata-dir 
>>>>>>>>>>>>>>>>>>>>>>>>>> /home/koichi/tesstutorial/e13boutput
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> The training from scratch ended as:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> At iteration 561/2500/2500, Mean rms=0.159%, 
>>>>>>>>>>>>>>>>>>>>>>>>>> delta=0%, char train=0%, word train=0%, skip 
>>>>>>>>>>>>>>>>>>>>>>>>>> ratio=0%,  New best char error 
>>>>>>>>>>>>>>>>>>>>>>>>>> = 0 wrote best 
>>>>>>>>>>>>>>>>>>>>>>>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint
>>>>>>>>>>>>>>>>>>>>>>>>>>  wrote 
>>>>>>>>>>>>>>>>>>>>>>>>>> checkpoint.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> The test with base_checkpoint returns nothing as:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> At iteration 0, stage 0, Eval Char error rate=0, 
>>>>>>>>>>>>>>>>>>>>>>>>>> Word error rate=0
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> The test with eng.traineddata and e13b.png 
>>>>>>>>>>>>>>>>>>>>>>>>>> returns out.txt.  Both files are attached.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Training seems to have worked fine.  I don't know 
>>>>>>>>>>>>>>>>>>>>>>>>>> how to translate the test result from 
>>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint.  The generated 
>>>>>>>>>>>>>>>>>>>>>>>>>> eng.traineddata obviously doesn't work well. I 
>>>>>>>>>>>>>>>>>>>>>>>>>> suspect the choice of 
>>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata in combining output files is bad but I 
>>>>>>>>>>>>>>>>>>>>>>>>>> have no clue.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> BTW, I referred to your tess4training in the 
>>>>>>>>>>>>>>>>>>>>>>>>>> process.  It helped a lot.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月29日水曜日 19時14分08秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> see 
>>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, May 29, 2019 at 3:18 PM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I wish to make a trained data for E13B font.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> I read the training tutorial and made a 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint file according to the method in 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Training From Scratch.  
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Now, how can I make a trained data from the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint file?
>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> You received this message because you are 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> subscribed to the Google Groups "tesseract-ocr" 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> group.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> receiving emails from it, send an email to 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesser...@googlegroups.com.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> To post to this group, send email to 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> tesser...@googlegroups.com.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> Visit this group at 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>>>>>> For more options, visit 
>>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/optout.
>>>>>>>>>>>>>>>>>>>>>>>>>>>> </blockquote
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e6d8db44-a5cc-4a1f-b655-37c7750133a3%40googlegroups.com.

Re: [tesseract-ocr] Trained data for E13B font

Reply via email to