Re: [tesseract-ocr] Trained data for E13B font

ElGato ElMago Tue, 06 Aug 2019 01:11:58 -0700

Hi,

FWIW, I got to the point where I can feel happy with the accuracy. As the 
images of the previous post show, the symbols, especially on-us symbol and 
amount symbol, were causing mix-up each other or to another character.  I 
added much more more symbols to the training text and formed words that 
start with a symbol.  One example is as follows:


9;:;=;<;< <0<1<3<4;6;8;9;:;=;


I randomly made 8,000 lines like this.  In fine-tuning from eng, 5,000 
iteration was almost good.  Amount symbol still is confused a little when 
it's followed by 0.  Fine tuning tends to be dragged by small particles.  
I'll have to think of something to make further improvement.

Training from scratch produced a bit more stable traineddata.  It doesn't 
get confused with symbols so often but tends to generate extra spaces.  By 
10,000 iterations, those spaces are gone and recognition became very solid.

I thought I might have to do image and box file training but I guess it's 
not needed this time.

ElMagoElGato

2019年7月26日金曜日 14時08分06秒 UTC+9 ElGato ElMago:
>
> HI,
>
> Well, I read the description of ScrollView (
> https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging) and it 
> says:
>
> To show the characters, deselect DISPLAY/Bounding Boxes, select 
> DISPLAY/Polygonal Approx and then select OTHER/Uniform display.
>
>
> It basically works.  But for some reason, it doesn't work on my e13b image 
> and ends up with a blue screen.  Anyway, it shows each box separately when 
> a character is consist of multiple boxes.  I'd like to show the box for the 
> whole character.  ScrollView doesn't do it, at least, yet.  I'll do it on 
> my own.
>
> ElMagoElGato
>
> 2019年7月24日水曜日 14時10分46秒 UTC+9 ElGato ElMago:
>>
>> Hi,
>>
>>
>> I got this result from hocr.  This is where one of the phantom characters 
>> comes from.
>>
>> <span class='ocrx_cinfo' title='x_bboxes 1259 902 1262 933; x_conf 
>> 98.864532'>&lt;</span>
>> <span class='ocrx_cinfo' title='x_bboxes 1259 904 1281 933; x_conf 
>> 99.018097'>;</span>
>>
>>
>> The firs character is the phantom.  It starts with the second character 
>> that exists on x axis.  The first character only has 3 points width.  I 
>> attach ScrollView screen shots that visualize this.
>>
>> [image: 2019-07-24-132643_854x707_scrot.png][image: 
>> 2019-07-24-132800_854x707_scrot.png]
>>
>>
>> There seem to be some more cases to cause phantom characters.  I'll look 
>> them in.  But I have a trivial question now.  I made ScrollView show these 
>> displays by accidentally clicking Display->Blamer menu.  There is Bounding 
>> Boxes menu below but it ends up showing a blue screen though it briefly 
>> shows boxes on the way.  Can I use this menu at all?  It'll be very useful.
>>
>> [image: 2019-07-24-140739_854x707_scrot.png]
>>
>>
>> 2019年7月23日火曜日 17時10分36秒 UTC+9 ElGato ElMago:
>>>
>>> It's great! Perfect!  Thanks a lot!
>>>
>>> 2019年7月23日火曜日 10時56分58秒 UTC+9 shree:
>>>>
>>>> See https://github.com/tesseract-ocr/tesseract/issues/2580
>>>>
>>>> On Tue, 23 Jul 2019, 06:23 ElGato ElMago, <elmago...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I read the output of hocr with lstm_choice_mode = 4 as to the pull 
>>>>> request 2554.  It shows the candidates for each character but doesn't 
>>>>> show 
>>>>> bounding box of each character.  I only shows the box for a whole word.
>>>>>
>>>>> I see bounding boxes of each character in comments of the pull request 
>>>>> 2576.  How can I do that?  Do I have to look in the source code and 
>>>>> manipulate such an output on my own?
>>>>>
>>>>> 2019年7月19日金曜日 18時40分49秒 UTC+9 ElGato ElMago:
>>>>>
>>>>>> Lorenzo,
>>>>>>
>>>>>> I haven't been checking psm too much.  Will turn to those options 
>>>>>> after I see how it goes with bounding boxes.
>>>>>>
>>>>>> Shree,
>>>>>>
>>>>>> I see the merges in the git log and also see that new 
>>>>>> option lstm_choice_amount works now.  I guess my executable is latest 
>>>>>> though I still see the phantom character.  Hocr makes huge and complex 
>>>>>> output.  I'll take some to read it.
>>>>>>
>>>>>> 2019年7月19日金曜日 18時20分55秒 UTC+9 Claudiu:
>>>>>>>
>>>>>>> Is there any way to pass bounding boxes to use to the LSTM? We have 
>>>>>>> an algorithm that cleanly gets bounding boxes of MRZ characters. 
>>>>>>> However 
>>>>>>> the results using psm 10 are worse than passing the whole line in. Yet 
>>>>>>> when 
>>>>>>> we pass the whole line in we get these phantom characters. 
>>>>>>>
>>>>>>> Should PSM 10 mode work? It often returns “no character” where there 
>>>>>>> clearly is one. I can supply a test case if it is expected to work 
>>>>>>> well. 
>>>>>>>
>>>>>>> On Fri, Jul 19, 2019 at 11:06 AM ElGato ElMago <elmago...@gmail.com> 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Lorenzo,
>>>>>>>>
>>>>>>>> We both have got the same case.  It seems a solution to this 
>>>>>>>> problem would save a lot of people.
>>>>>>>>
>>>>>>>> Shree,
>>>>>>>>
>>>>>>>> I pulled the current head of master branch but it doesn't seem to 
>>>>>>>> contain the merges you pointed that have been merged 3 to 4 days ago.  
>>>>>>>> How 
>>>>>>>> can I get them?
>>>>>>>>
>>>>>>>> ElMagoElGato
>>>>>>>>
>>>>>>>> 2019年7月19日金曜日 17時02分53秒 UTC+9 Lorenzo Blz:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> PSM 7 was a partial solution for my specific case, it improved the 
>>>>>>>>> situation but did not solve it. Also I could not use it in some other 
>>>>>>>>> cases.
>>>>>>>>>
>>>>>>>>> The proper solution is very likely doing more training with more 
>>>>>>>>> data, some data augmentation might probably help if data is scarce.
>>>>>>>>> Also doing less training might help is the training is not done 
>>>>>>>>> correctly.
>>>>>>>>>
>>>>>>>>> There are also similar issues on github:
>>>>>>>>>
>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1465
>>>>>>>>> ...
>>>>>>>>>
>>>>>>>>> The LSTM engine works like this: it scans the image and for each 
>>>>>>>>> "pixel column" does this:
>>>>>>>>>
>>>>>>>>> M M M M N M M M [BLANK] F F F F
>>>>>>>>>
>>>>>>>>> (here i report only the highest probability characters)
>>>>>>>>>
>>>>>>>>> In the example above an M is partially seen as an N, this is 
>>>>>>>>> normal, and another step of the algorithm (beam search I think) tries 
>>>>>>>>> to 
>>>>>>>>> aggregate back the correct characters.
>>>>>>>>>
>>>>>>>>> I think cases like this:
>>>>>>>>>
>>>>>>>>> M M M N N N M M
>>>>>>>>>
>>>>>>>>> are what gives the phantom characters. More training should reduce 
>>>>>>>>> the source of the problem or a painful analysis of the bounding boxes 
>>>>>>>>> might 
>>>>>>>>> fix some cases.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I used the attached script for the boxes.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Lorenzo
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Il giorno ven 19 lug 2019 alle ore 07:25 ElGato ElMago <
>>>>>>>>> elmago...@gmail.com> ha scritto:
>>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Let's call them phantom characters then.
>>>>>>>>>>
>>>>>>>>>> Was psm 7 the solution for the issue 1778?  None of the psm 
>>>>>>>>>> option didn't solve my problem though I see different output.
>>>>>>>>>>
>>>>>>>>>> I use tesseract 5.0-alpha mostly but 4.1 showed the same results 
>>>>>>>>>> anyway.  How did you get bounding box for each character?  Alto and 
>>>>>>>>>> lstmbox 
>>>>>>>>>> only show bbox for a group of characters.
>>>>>>>>>>
>>>>>>>>>> ElMagoElGato
>>>>>>>>>>
>>>>>>>>>> 2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz:
>>>>>>>>>>
>>>>>>>>>>> Phantom characters here for me too:
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1778
>>>>>>>>>>>
>>>>>>>>>>> Are you using 4.1? Bounding boxes were fixed in 4.1 maybe this 
>>>>>>>>>>> was also improved.
>>>>>>>>>>>
>>>>>>>>>>> I wrote some code that uses symbols iterator to discard symbols 
>>>>>>>>>>> that are clearly duplicated: too small, overlapping, etc. But it 
>>>>>>>>>>> was not 
>>>>>>>>>>> easy to make it work decently and it is not 100% reliable with 
>>>>>>>>>>> false 
>>>>>>>>>>> negatives and positives. I cannot share the code and it is quite 
>>>>>>>>>>> ugly 
>>>>>>>>>>> anyway.
>>>>>>>>>>>
>>>>>>>>>>> Here there is another MRZ model with training data:
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/DoubangoTelecom/tesseractMRZ
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Lorenzo
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu <
>>>>>>>>>>> csaf...@gmail.com> ha scritto:
>>>>>>>>>>>
>>>>>>>>>>>> I’m getting the “phantom character” issue as well using the 
>>>>>>>>>>>> OCRB that Shree trained on MRZ lines. For example for a 0 it will 
>>>>>>>>>>>> sometimes 
>>>>>>>>>>>> add both a 0 and an O to the output , thus outputting 45 
>>>>>>>>>>>> characters total 
>>>>>>>>>>>> instead of 44. I haven’t looked at the bounding box output yet but 
>>>>>>>>>>>> I 
>>>>>>>>>>>> suspect a phantom thin character is added somewhere that I can 
>>>>>>>>>>>> discard .. 
>>>>>>>>>>>> or maybe two chars will have the same bounding box. If anyone else 
>>>>>>>>>>>> has 
>>>>>>>>>>>> fixed this issue further up (eg so the output doesn’t contain the 
>>>>>>>>>>>> phantom 
>>>>>>>>>>>> characters in the first place) id be interested. 
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago <
>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'll go back to more of training later.  Before doing so, I'd 
>>>>>>>>>>>>> like to investigate results a little bit.  The hocr and lstmbox 
>>>>>>>>>>>>> options 
>>>>>>>>>>>>> give some details of positions of characters.  The results show 
>>>>>>>>>>>>> positions 
>>>>>>>>>>>>> that perfectly correspond to letters in the image.  But the text 
>>>>>>>>>>>>> output 
>>>>>>>>>>>>> contains a character that obviously does not exist.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Then I found a config file 'lstmdebug' that generates far more 
>>>>>>>>>>>>> information.  I hope it explains what happened with each 
>>>>>>>>>>>>> character.  I'm 
>>>>>>>>>>>>> yet to read the debug output but I'd appreciate it if someone 
>>>>>>>>>>>>> could tell me 
>>>>>>>>>>>>> how to read it because it's really complex.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019年6月14日金曜日 19時58分49秒 UTC+9 shree:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_MICR
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have uploaded my files there. 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh
>>>>>>>>>>>>>> is the bash script that runs the training.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> You can modify as needed. Please note this is for legacy/base 
>>>>>>>>>>>>>> tesseract --oem 0.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago <
>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks a lot, shree.  It seems you know everything.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I tried the MICR0.traineddata and the first two 
>>>>>>>>>>>>>>> mcr.traineddata.  The last one was blocked by the browser.  
>>>>>>>>>>>>>>> Each of the 
>>>>>>>>>>>>>>> traineddata had mixed results.  All of them are getting symbols 
>>>>>>>>>>>>>>> fairly good 
>>>>>>>>>>>>>>> but getting spaces randomly and reading some numbers wrong.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> MICR0 seems the best among them.  Did you suggest that you'd 
>>>>>>>>>>>>>>> be able to update it?  It gets tripple D very often where 
>>>>>>>>>>>>>>> there's only one, 
>>>>>>>>>>>>>>> and so on.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Also, I tried to fine tune from MICR0 but I found that I 
>>>>>>>>>>>>>>> need to change the language-specific.sh.  It specifies some 
>>>>>>>>>>>>>>> parameters for 
>>>>>>>>>>>>>>> each language.  Do you have any guidance for it?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> see 
>>>>>>>>>>>>>>>> http://www.devscope.net/Content/ocrchecks.aspx 
>>>>>>>>>>>>>>>> https://github.com/BigPino67/Tesseract-MICR-OCR
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ
>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago <
>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> That'll be nice if there's traineddata out there but I 
>>>>>>>>>>>>>>>>> didn't find any.  I see free fonts and commercial OCR 
>>>>>>>>>>>>>>>>> software but not 
>>>>>>>>>>>>>>>>> traineddata.  Tessdata repository obviously doesn't have one, 
>>>>>>>>>>>>>>>>> either.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Please also search for existing MICR traineddata files.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago <
>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> So I did several tests from scratch.  In the last 
>>>>>>>>>>>>>>>>>>> attempt, I made a training text with 4,000 lines in the 
>>>>>>>>>>>>>>>>>>> following format,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 110004310510<   <02 :4002=0181:801= 0008752 <00039 
>>>>>>>>>>>>>>>>>>> ;0000001000;
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> and combined it with eng.digits.training_text in which 
>>>>>>>>>>>>>>>>>>> symbols are converted to E13B symbols.  This makes about 
>>>>>>>>>>>>>>>>>>> 12,000 lines of 
>>>>>>>>>>>>>>>>>>> training text.  It's amazing that this thing generates a 
>>>>>>>>>>>>>>>>>>> good reader out of 
>>>>>>>>>>>>>>>>>>> nowhere.  But then it is not very good.  For example:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134;
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> is a result on the image attached.  It's close but the 
>>>>>>>>>>>>>>>>>>> last '<' in the result text doesn't exist on the image.  
>>>>>>>>>>>>>>>>>>> It's a small 
>>>>>>>>>>>>>>>>>>> failure but it causes a greater trouble in parsing.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> What would you suggest from here to increase accuracy?  
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>    - Increase the number of lines in the training text
>>>>>>>>>>>>>>>>>>>    - Mix up more variations in the training text
>>>>>>>>>>>>>>>>>>>    - Increase the number of iterations
>>>>>>>>>>>>>>>>>>>    - Investigate wrong reads one by one
>>>>>>>>>>>>>>>>>>>    - Or else?
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Also, I referred to engrestrict*.* and could generate 
>>>>>>>>>>>>>>>>>>> similar result with the fine-tuning-from-full method.  It 
>>>>>>>>>>>>>>>>>>> seems a bit 
>>>>>>>>>>>>>>>>>>> faster to get to the same level but it also stops at a 
>>>>>>>>>>>>>>>>>>> 'good' level.  I can 
>>>>>>>>>>>>>>>>>>> go with either way if it takes me to the bright future.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Thanks a lot, Shree. I'll look it in.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_shreetest
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Look at the files engrestrict*.* and also 
>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Create training text of about 100 lines and finetune 
>>>>>>>>>>>>>>>>>>>>> for 400 lines 
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I had about 14 lines as attached.  How many lines 
>>>>>>>>>>>>>>>>>>>>>> would you recommend?
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Fine tuning gives much better result but it tends to 
>>>>>>>>>>>>>>>>>>>>>> pick other character than in E13B that only has 14 
>>>>>>>>>>>>>>>>>>>>>> characters, 0 through 9 
>>>>>>>>>>>>>>>>>>>>>> and 4 symbols.  I thought training from scratch would 
>>>>>>>>>>>>>>>>>>>>>> eliminate such 
>>>>>>>>>>>>>>>>>>>>>> confusion.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> For training from scratch a large training text and 
>>>>>>>>>>>>>>>>>>>>>>> hundreds of thousands of iterations are recommended. 
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> If you are just fine tuning for a font try to follow 
>>>>>>>>>>>>>>>>>>>>>>> instructions for training for impact, with your font.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, <
>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Thanks, Shree.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Yes, I saw the instruction.  The steps I made are 
>>>>>>>>>>>>>>>>>>>>>>>> as follows:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Using tesstrain.sh:
>>>>>>>>>>>>>>>>>>>>>>>> src/training/tesstrain.sh --fonts_dir 
>>>>>>>>>>>>>>>>>>>>>>>> /usr/share/fonts --lang eng --linedata_only \
>>>>>>>>>>>>>>>>>>>>>>>>   --noextract_font_properties --langdata_dir 
>>>>>>>>>>>>>>>>>>>>>>>> ../langdata \
>>>>>>>>>>>>>>>>>>>>>>>>   --tessdata_dir ./tessdata \
>>>>>>>>>>>>>>>>>>>>>>>>   --fontlist "E13Bnsd" --output_dir 
>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval \
>>>>>>>>>>>>>>>>>>>>>>>>   --training_text 
>>>>>>>>>>>>>>>>>>>>>>>> ../langdata/eng/eng.training_e13b_text
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Training from scratch:
>>>>>>>>>>>>>>>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput
>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --debug_interval 100 \
>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>   --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 
>>>>>>>>>>>>>>>>>>>>>>>> Lrx96 Lfx256 O1c111]' \
>>>>>>>>>>>>>>>>>>>>>>>>   --model_output ~/tesstutorial/e13boutput/base 
>>>>>>>>>>>>>>>>>>>>>>>> --learning_rate 20e-4 \
>>>>>>>>>>>>>>>>>>>>>>>>   --train_listfile 
>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>>>>>>>>>>>>>>>>>>>>>   --eval_listfile 
>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>>>>>>>>>>>>>>>>>>>>>   --max_iterations 5000 
>>>>>>>>>>>>>>>>>>>>>>>> &>~/tesstutorial/e13boutput/basetrain.log
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Test with base_checkpoint:
>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmeval --model 
>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>   --eval_listfile 
>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Combining output files:
>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --stop_training \
>>>>>>>>>>>>>>>>>>>>>>>>   --continue_from 
>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>   --model_output 
>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/eng.traineddata
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Test with eng.traineddata:
>>>>>>>>>>>>>>>>>>>>>>>> tesseract e13b.png out --tessdata-dir 
>>>>>>>>>>>>>>>>>>>>>>>> /home/koichi/tesstutorial/e13boutput
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> The training from scratch ended as:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> At iteration 561/2500/2500, Mean rms=0.159%, 
>>>>>>>>>>>>>>>>>>>>>>>> delta=0%, char train=0%, word train=0%, skip ratio=0%, 
>>>>>>>>>>>>>>>>>>>>>>>>  New best char error 
>>>>>>>>>>>>>>>>>>>>>>>> = 0 wrote best 
>>>>>>>>>>>>>>>>>>>>>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint
>>>>>>>>>>>>>>>>>>>>>>>>  wrote 
>>>>>>>>>>>>>>>>>>>>>>>> checkpoint.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> The test with base_checkpoint returns nothing as:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> At iteration 0, stage 0, Eval Char error rate=0, 
>>>>>>>>>>>>>>>>>>>>>>>> Word error rate=0
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> The test with eng.traineddata and e13b.png returns 
>>>>>>>>>>>>>>>>>>>>>>>> out.txt.  Both files are attached.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Training seems to have worked fine.  I don't know 
>>>>>>>>>>>>>>>>>>>>>>>> how to translate the test result from base_checkpoint. 
>>>>>>>>>>>>>>>>>>>>>>>>  The generated 
>>>>>>>>>>>>>>>>>>>>>>>> eng.traineddata obviously doesn't work well. I suspect 
>>>>>>>>>>>>>>>>>>>>>>>> the choice of 
>>>>>>>>>>>>>>>>>>>>>>>> --traineddata in combining output files is bad but I 
>>>>>>>>>>>>>>>>>>>>>>>> have no clue.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> BTW, I referred to your tess4training in the 
>>>>>>>>>>>>>>>>>>>>>>>> process.  It helped a lot.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月29日水曜日 19時14分08秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> see 
>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, May 29, 2019 at 3:18 PM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> I wish to make a trained data for E13B font.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> I read the training tutorial and made a 
>>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint file according to the method in 
>>>>>>>>>>>>>>>>>>>>>>>>>> Training From Scratch.  
>>>>>>>>>>>>>>>>>>>>>>>>>> Now, how can I make a trained data from the 
>>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint file?
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>>>>>>>> You received this message because you are 
>>>>>>>>>>>>>>>>>>>>>>>>>> subscribed to the Google Groups "tesseract-ocr" 
>>>>>>>>>>>>>>>>>>>>>>>>>> group.
>>>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving 
>>>>>>>>>>>>>>>>>>>>>>>>>> emails from it, send an email to 
>>>>>>>>>>>>>>>>>>>>>>>>>> tesser...@googlegroups.com.
>>>>>>>>>>>>>>>>>>>>>>>>>> To post to this group, send email to 
>>>>>>>>>>>>>>>>>>>>>>>>>> tesser...@googlegroups.com.
>>>>>>>>>>>>>>>>>>>>>>>>>> Visit this group at 
>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr.
>>>>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>>>> For more options, visit 
>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/optout.
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> ____________________________________________________________
>>>>>>>>>>>>>>>>>>>>>>>>> भजन - कीर्तन - आरती @ 
>>>>>>>>>>>>>>>>>>>>>>>>> http://bhajans.ramparivar.com
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>>>>>> You received this message because you are 
>>>>>>>>>>>>>>>>>>>>>>>> subscribed to the Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving 
>>>>>>>>>>>>>>>>>>>>>>>> emails from it, send an email to 
>>>>>>>>>>>>>>>>>>>>>>>> tesser...@googlegroups.com.
>>>>>>>>>>>>>>>>>>>>>>>> To post to this group, send email to 
>>>>>>>>>>>>>>>>>>>>>>>> tesser...@googlegroups.com.
>>>>>>>>>>>>>>>>>>>>>>>> Visit this group at 
>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr.
>>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>> For more options, visit <a href="
>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/optout"; 
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3cecc106-fbb9-4a4a-bd98-e992ec034cef%40googlegroups.com.

Re: [tesseract-ocr] Trained data for E13B font

Reply via email to