Hello,
Are you planning to release the dataset or models?
I'm working on the same subject and planning to share both under BSD terms

On Tuesday, August 6, 2019 at 10:11:40 AM UTC+2, ElGato ElMago wrote:
>
> Hi,
>
> FWIW, I got to the point where I can feel happy with the accuracy. As the 
> images of the previous post show, the symbols, especially on-us symbol and 
> amount symbol, were causing mix-up each other or to another character.  I 
> added much more more symbols to the training text and formed words that 
> start with a symbol.  One example is as follows:
>
> 9;:;=;<;< <0<1<3<4;6;8;9;:;=;
>
>
> I randomly made 8,000 lines like this.  In fine-tuning from eng, 5,000 
> iteration was almost good.  Amount symbol still is confused a little when 
> it's followed by 0.  Fine tuning tends to be dragged by small particles.  
> I'll have to think of something to make further improvement.
>
> Training from scratch produced a bit more stable traineddata.  It doesn't 
> get confused with symbols so often but tends to generate extra spaces.  By 
> 10,000 iterations, those spaces are gone and recognition became very solid.
>
> I thought I might have to do image and box file training but I guess it's 
> not needed this time.
>
> ElMagoElGato
>
> 2019年7月26日金曜日 14時08分06秒 UTC+9 ElGato ElMago:
>>
>> HI,
>>
>> Well, I read the description of ScrollView (
>> https://github.com/tesseract-ocr/tesseract/wiki/ViewerDebugging) and it 
>> says:
>>
>> To show the characters, deselect DISPLAY/Bounding Boxes, select 
>> DISPLAY/Polygonal Approx and then select OTHER/Uniform display.
>>
>>
>> It basically works.  But for some reason, it doesn't work on my e13b 
>> image and ends up with a blue screen.  Anyway, it shows each box separately 
>> when a character is consist of multiple boxes.  I'd like to show the box 
>> for the whole character.  ScrollView doesn't do it, at least, yet.  I'll do 
>> it on my own.
>>
>> ElMagoElGato
>>
>> 2019年7月24日水曜日 14時10分46秒 UTC+9 ElGato ElMago:
>>>
>>> Hi,
>>>
>>>
>>> I got this result from hocr.  This is where one of the phantom 
>>> characters comes from.
>>>
>>> <span class='ocrx_cinfo' title='x_bboxes 1259 902 1262 933; x_conf 
>>> 98.864532'>&lt;</span>
>>> <span class='ocrx_cinfo' title='x_bboxes 1259 904 1281 933; x_conf 
>>> 99.018097'>;</span>
>>>
>>>
>>> The firs character is the phantom.  It starts with the second character 
>>> that exists on x axis.  The first character only has 3 points width.  I 
>>> attach ScrollView screen shots that visualize this.
>>>
>>> [image: 2019-07-24-132643_854x707_scrot.png][image: 
>>> 2019-07-24-132800_854x707_scrot.png]
>>>
>>>
>>> There seem to be some more cases to cause phantom characters.  I'll look 
>>> them in.  But I have a trivial question now.  I made ScrollView show these 
>>> displays by accidentally clicking Display->Blamer menu.  There is Bounding 
>>> Boxes menu below but it ends up showing a blue screen though it briefly 
>>> shows boxes on the way.  Can I use this menu at all?  It'll be very useful.
>>>
>>> [image: 2019-07-24-140739_854x707_scrot.png]
>>>
>>>
>>> 2019年7月23日火曜日 17時10分36秒 UTC+9 ElGato ElMago:
>>>>
>>>> It's great! Perfect!  Thanks a lot!
>>>>
>>>> 2019年7月23日火曜日 10時56分58秒 UTC+9 shree:
>>>>>
>>>>> See https://github.com/tesseract-ocr/tesseract/issues/2580
>>>>>
>>>>> On Tue, 23 Jul 2019, 06:23 ElGato ElMago, <elmago...@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I read the output of hocr with lstm_choice_mode = 4 as to the pull 
>>>>>> request 2554.  It shows the candidates for each character but doesn't 
>>>>>> show 
>>>>>> bounding box of each character.  I only shows the box for a whole word.
>>>>>>
>>>>>> I see bounding boxes of each character in comments of the pull 
>>>>>> request 2576.  How can I do that?  Do I have to look in the source code 
>>>>>> and 
>>>>>> manipulate such an output on my own?
>>>>>>
>>>>>> 2019年7月19日金曜日 18時40分49秒 UTC+9 ElGato ElMago:
>>>>>>
>>>>>>> Lorenzo,
>>>>>>>
>>>>>>> I haven't been checking psm too much.  Will turn to those options 
>>>>>>> after I see how it goes with bounding boxes.
>>>>>>>
>>>>>>> Shree,
>>>>>>>
>>>>>>> I see the merges in the git log and also see that new 
>>>>>>> option lstm_choice_amount works now.  I guess my executable is latest 
>>>>>>> though I still see the phantom character.  Hocr makes huge and complex 
>>>>>>> output.  I'll take some to read it.
>>>>>>>
>>>>>>> 2019年7月19日金曜日 18時20分55秒 UTC+9 Claudiu:
>>>>>>>>
>>>>>>>> Is there any way to pass bounding boxes to use to the LSTM? We have 
>>>>>>>> an algorithm that cleanly gets bounding boxes of MRZ characters. 
>>>>>>>> However 
>>>>>>>> the results using psm 10 are worse than passing the whole line in. Yet 
>>>>>>>> when 
>>>>>>>> we pass the whole line in we get these phantom characters. 
>>>>>>>>
>>>>>>>> Should PSM 10 mode work? It often returns “no character” where 
>>>>>>>> there clearly is one. I can supply a test case if it is expected to 
>>>>>>>> work 
>>>>>>>> well. 
>>>>>>>>
>>>>>>>> On Fri, Jul 19, 2019 at 11:06 AM ElGato ElMago <elmago...@gmail.com> 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Lorenzo,
>>>>>>>>>
>>>>>>>>> We both have got the same case.  It seems a solution to this 
>>>>>>>>> problem would save a lot of people.
>>>>>>>>>
>>>>>>>>> Shree,
>>>>>>>>>
>>>>>>>>> I pulled the current head of master branch but it doesn't seem to 
>>>>>>>>> contain the merges you pointed that have been merged 3 to 4 days ago. 
>>>>>>>>>  How 
>>>>>>>>> can I get them?
>>>>>>>>>
>>>>>>>>> ElMagoElGato
>>>>>>>>>
>>>>>>>>> 2019年7月19日金曜日 17時02分53秒 UTC+9 Lorenzo Blz:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> PSM 7 was a partial solution for my specific case, it improved 
>>>>>>>>>> the situation but did not solve it. Also I could not use it in some 
>>>>>>>>>> other 
>>>>>>>>>> cases.
>>>>>>>>>>
>>>>>>>>>> The proper solution is very likely doing more training with more 
>>>>>>>>>> data, some data augmentation might probably help if data is scarce.
>>>>>>>>>> Also doing less training might help is the training is not done 
>>>>>>>>>> correctly.
>>>>>>>>>>
>>>>>>>>>> There are also similar issues on github:
>>>>>>>>>>
>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1465
>>>>>>>>>> ...
>>>>>>>>>>
>>>>>>>>>> The LSTM engine works like this: it scans the image and for each 
>>>>>>>>>> "pixel column" does this:
>>>>>>>>>>
>>>>>>>>>> M M M M N M M M [BLANK] F F F F
>>>>>>>>>>
>>>>>>>>>> (here i report only the highest probability characters)
>>>>>>>>>>
>>>>>>>>>> In the example above an M is partially seen as an N, this is 
>>>>>>>>>> normal, and another step of the algorithm (beam search I think) 
>>>>>>>>>> tries to 
>>>>>>>>>> aggregate back the correct characters.
>>>>>>>>>>
>>>>>>>>>> I think cases like this:
>>>>>>>>>>
>>>>>>>>>> M M M N N N M M
>>>>>>>>>>
>>>>>>>>>> are what gives the phantom characters. More training should 
>>>>>>>>>> reduce the source of the problem or a painful analysis of the 
>>>>>>>>>> bounding 
>>>>>>>>>> boxes might fix some cases.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I used the attached script for the boxes.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Lorenzo
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Il giorno ven 19 lug 2019 alle ore 07:25 ElGato ElMago <
>>>>>>>>>> elmago...@gmail.com> ha scritto:
>>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Let's call them phantom characters then.
>>>>>>>>>>>
>>>>>>>>>>> Was psm 7 the solution for the issue 1778?  None of the psm 
>>>>>>>>>>> option didn't solve my problem though I see different output.
>>>>>>>>>>>
>>>>>>>>>>> I use tesseract 5.0-alpha mostly but 4.1 showed the same results 
>>>>>>>>>>> anyway.  How did you get bounding box for each character?  Alto and 
>>>>>>>>>>> lstmbox 
>>>>>>>>>>> only show bbox for a group of characters.
>>>>>>>>>>>
>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>
>>>>>>>>>>> 2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz:
>>>>>>>>>>>
>>>>>>>>>>>> Phantom characters here for me too:
>>>>>>>>>>>>
>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1778
>>>>>>>>>>>>
>>>>>>>>>>>> Are you using 4.1? Bounding boxes were fixed in 4.1 maybe this 
>>>>>>>>>>>> was also improved.
>>>>>>>>>>>>
>>>>>>>>>>>> I wrote some code that uses symbols iterator to discard symbols 
>>>>>>>>>>>> that are clearly duplicated: too small, overlapping, etc. But it 
>>>>>>>>>>>> was not 
>>>>>>>>>>>> easy to make it work decently and it is not 100% reliable with 
>>>>>>>>>>>> false 
>>>>>>>>>>>> negatives and positives. I cannot share the code and it is quite 
>>>>>>>>>>>> ugly 
>>>>>>>>>>>> anyway.
>>>>>>>>>>>>
>>>>>>>>>>>> Here there is another MRZ model with training data:
>>>>>>>>>>>>
>>>>>>>>>>>> https://github.com/DoubangoTelecom/tesseractMRZ
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Lorenzo
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu <
>>>>>>>>>>>> csaf...@gmail.com> ha scritto:
>>>>>>>>>>>>
>>>>>>>>>>>>> I’m getting the “phantom character” issue as well using the 
>>>>>>>>>>>>> OCRB that Shree trained on MRZ lines. For example for a 0 it will 
>>>>>>>>>>>>> sometimes 
>>>>>>>>>>>>> add both a 0 and an O to the output , thus outputting 45 
>>>>>>>>>>>>> characters total 
>>>>>>>>>>>>> instead of 44. I haven’t looked at the bounding box output yet 
>>>>>>>>>>>>> but I 
>>>>>>>>>>>>> suspect a phantom thin character is added somewhere that I can 
>>>>>>>>>>>>> discard .. 
>>>>>>>>>>>>> or maybe two chars will have the same bounding box. If anyone 
>>>>>>>>>>>>> else has 
>>>>>>>>>>>>> fixed this issue further up (eg so the output doesn’t contain the 
>>>>>>>>>>>>> phantom 
>>>>>>>>>>>>> characters in the first place) id be interested. 
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago <
>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'll go back to more of training later.  Before doing so, I'd 
>>>>>>>>>>>>>> like to investigate results a little bit.  The hocr and lstmbox 
>>>>>>>>>>>>>> options 
>>>>>>>>>>>>>> give some details of positions of characters.  The results show 
>>>>>>>>>>>>>> positions 
>>>>>>>>>>>>>> that perfectly correspond to letters in the image.  But the text 
>>>>>>>>>>>>>> output 
>>>>>>>>>>>>>> contains a character that obviously does not exist.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Then I found a config file 'lstmdebug' that generates far 
>>>>>>>>>>>>>> more information.  I hope it explains what happened with each 
>>>>>>>>>>>>>> character.  
>>>>>>>>>>>>>> I'm yet to read the debug output but I'd appreciate it if 
>>>>>>>>>>>>>> someone could 
>>>>>>>>>>>>>> tell me how to read it because it's really complex.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2019年6月14日金曜日 19時58分49秒 UTC+9 shree:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_MICR
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have uploaded my files there. 
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh
>>>>>>>>>>>>>>> is the bash script that runs the training.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> You can modify as needed. Please note this is for 
>>>>>>>>>>>>>>> legacy/base tesseract --oem 0.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago <
>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks a lot, shree.  It seems you know everything.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I tried the MICR0.traineddata and the first two 
>>>>>>>>>>>>>>>> mcr.traineddata.  The last one was blocked by the browser.  
>>>>>>>>>>>>>>>> Each of the 
>>>>>>>>>>>>>>>> traineddata had mixed results.  All of them are getting 
>>>>>>>>>>>>>>>> symbols fairly good 
>>>>>>>>>>>>>>>> but getting spaces randomly and reading some numbers wrong.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> MICR0 seems the best among them.  Did you suggest that 
>>>>>>>>>>>>>>>> you'd be able to update it?  It gets tripple D very often 
>>>>>>>>>>>>>>>> where there's 
>>>>>>>>>>>>>>>> only one, and so on.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Also, I tried to fine tune from MICR0 but I found that I 
>>>>>>>>>>>>>>>> need to change the language-specific.sh.  It specifies some 
>>>>>>>>>>>>>>>> parameters for 
>>>>>>>>>>>>>>>> each language.  Do you have any guidance for it?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> see 
>>>>>>>>>>>>>>>>> http://www.devscope.net/Content/ocrchecks.aspx 
>>>>>>>>>>>>>>>>> https://github.com/BigPino67/Tesseract-MICR-OCR
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ
>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago <
>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> That'll be nice if there's traineddata out there but I 
>>>>>>>>>>>>>>>>>> didn't find any.  I see free fonts and commercial OCR 
>>>>>>>>>>>>>>>>>> software but not 
>>>>>>>>>>>>>>>>>> traineddata.  Tessdata repository obviously doesn't have 
>>>>>>>>>>>>>>>>>> one, either.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Please also search for existing MICR traineddata files.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago <
>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> So I did several tests from scratch.  In the last 
>>>>>>>>>>>>>>>>>>>> attempt, I made a training text with 4,000 lines in the 
>>>>>>>>>>>>>>>>>>>> following format,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 110004310510<   <02 :4002=0181:801= 0008752 <00039 
>>>>>>>>>>>>>>>>>>>> ;0000001000;
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> and combined it with eng.digits.training_text in which 
>>>>>>>>>>>>>>>>>>>> symbols are converted to E13B symbols.  This makes about 
>>>>>>>>>>>>>>>>>>>> 12,000 lines of 
>>>>>>>>>>>>>>>>>>>> training text.  It's amazing that this thing generates a 
>>>>>>>>>>>>>>>>>>>> good reader out of 
>>>>>>>>>>>>>>>>>>>> nowhere.  But then it is not very good.  For example:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134;
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> is a result on the image attached.  It's close but the 
>>>>>>>>>>>>>>>>>>>> last '<' in the result text doesn't exist on the image.  
>>>>>>>>>>>>>>>>>>>> It's a small 
>>>>>>>>>>>>>>>>>>>> failure but it causes a greater trouble in parsing.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> What would you suggest from here to increase accuracy?  
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>    - Increase the number of lines in the training text
>>>>>>>>>>>>>>>>>>>>    - Mix up more variations in the training text
>>>>>>>>>>>>>>>>>>>>    - Increase the number of iterations
>>>>>>>>>>>>>>>>>>>>    - Investigate wrong reads one by one
>>>>>>>>>>>>>>>>>>>>    - Or else?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Also, I referred to engrestrict*.* and could generate 
>>>>>>>>>>>>>>>>>>>> similar result with the fine-tuning-from-full method.  It 
>>>>>>>>>>>>>>>>>>>> seems a bit 
>>>>>>>>>>>>>>>>>>>> faster to get to the same level but it also stops at a 
>>>>>>>>>>>>>>>>>>>> 'good' level.  I can 
>>>>>>>>>>>>>>>>>>>> go with either way if it takes me to the bright future.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks a lot, Shree. I'll look it in.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_shreetest
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Look at the files engrestrict*.* and also 
>>>>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Create training text of about 100 lines and finetune 
>>>>>>>>>>>>>>>>>>>>>> for 400 lines 
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> I had about 14 lines as attached.  How many lines 
>>>>>>>>>>>>>>>>>>>>>>> would you recommend?
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Fine tuning gives much better result but it tends to 
>>>>>>>>>>>>>>>>>>>>>>> pick other character than in E13B that only has 14 
>>>>>>>>>>>>>>>>>>>>>>> characters, 0 through 9 
>>>>>>>>>>>>>>>>>>>>>>> and 4 symbols.  I thought training from scratch would 
>>>>>>>>>>>>>>>>>>>>>>> eliminate such 
>>>>>>>>>>>>>>>>>>>>>>> confusion.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> For training from scratch a large training text and 
>>>>>>>>>>>>>>>>>>>>>>>> hundreds of thousands of iterations are recommended. 
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> If you are just fine tuning for a font try to 
>>>>>>>>>>>>>>>>>>>>>>>> follow instructions for training for impact, with your 
>>>>>>>>>>>>>>>>>>>>>>>> font.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, <
>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks, Shree.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Yes, I saw the instruction.  The steps I made are 
>>>>>>>>>>>>>>>>>>>>>>>>> as follows:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Using tesstrain.sh:
>>>>>>>>>>>>>>>>>>>>>>>>> src/training/tesstrain.sh --fonts_dir 
>>>>>>>>>>>>>>>>>>>>>>>>> /usr/share/fonts --lang eng --linedata_only \
>>>>>>>>>>>>>>>>>>>>>>>>>   --noextract_font_properties --langdata_dir 
>>>>>>>>>>>>>>>>>>>>>>>>> ../langdata \
>>>>>>>>>>>>>>>>>>>>>>>>>   --tessdata_dir ./tessdata \
>>>>>>>>>>>>>>>>>>>>>>>>>   --fontlist "E13Bnsd" --output_dir 
>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval \
>>>>>>>>>>>>>>>>>>>>>>>>>   --training_text 
>>>>>>>>>>>>>>>>>>>>>>>>> ../langdata/eng/eng.training_e13b_text
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Training from scratch:
>>>>>>>>>>>>>>>>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput
>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --debug_interval 100 \
>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>   --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 
>>>>>>>>>>>>>>>>>>>>>>>>> Lfx96 Lrx96 Lfx256 O1c111]' \
>>>>>>>>>>>>>>>>>>>>>>>>>   --model_output ~/tesstutorial/e13boutput/base 
>>>>>>>>>>>>>>>>>>>>>>>>> --learning_rate 20e-4 \
>>>>>>>>>>>>>>>>>>>>>>>>>   --train_listfile 
>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>>>>>>>>>>>>>>>>>>>>>>   --eval_listfile 
>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>>>>>>>>>>>>>>>>>>>>>>   --max_iterations 5000 
>>>>>>>>>>>>>>>>>>>>>>>>> &>~/tesstutorial/e13boutput/basetrain.log
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Test with base_checkpoint:
>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmeval --model 
>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>   --eval_listfile 
>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Combining output files:
>>>>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --stop_training \
>>>>>>>>>>>>>>>>>>>>>>>>>   --continue_from 
>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>>>>   --model_output 
>>>>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/eng.traineddata
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Test with eng.traineddata:
>>>>>>>>>>>>>>>>>>>>>>>>> tesseract e13b.png out --tessdata-dir 
>>>>>>>>>>>>>>>>>>>>>>>>> /home/koichi/tesstutorial/e13boutput
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> The training from scratch ended as:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> At iteration 561/2500/2500, Mean rms=0.159%, 
>>>>>>>>>>>>>>>>>>>>>>>>> delta=0%, char train=0%, word train=0%, skip 
>>>>>>>>>>>>>>>>>>>>>>>>> ratio=0%,  New best char error 
>>>>>>>>>>>>>>>>>>>>>>>>> = 0 wrote best 
>>>>>>>>>>>>>>>>>>>>>>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint
>>>>>>>>>>>>>>>>>>>>>>>>>  wrote 
>>>>>>>>>>>>>>>>>>>>>>>>> checkpoint.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> The test with base_checkpoint returns nothing as:
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> At iteration 0, stage 0, Eval Char error rate=0, 
>>>>>>>>>>>>>>>>>>>>>>>>> Word error rate=0
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> The test with eng.traineddata and e13b.png returns 
>>>>>>>>>>>>>>>>>>>>>>>>> out.txt.  Both files are attached.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Training seems to have worked fine.  I don't know 
>>>>>>>>>>>>>>>>>>>>>>>>> how to translate the test result from 
>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint.  The generated 
>>>>>>>>>>>>>>>>>>>>>>>>> eng.traineddata obviously doesn't work well. I 
>>>>>>>>>>>>>>>>>>>>>>>>> suspect the choice of 
>>>>>>>>>>>>>>>>>>>>>>>>> --traineddata in combining output files is bad but I 
>>>>>>>>>>>>>>>>>>>>>>>>> have no clue.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> BTW, I referred to your tess4training in the 
>>>>>>>>>>>>>>>>>>>>>>>>> process.  It helped a lot.
>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>> 2019年5月29日水曜日 19時14分08秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> see 
>>>>>>>>>>>>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> On Wed, May 29, 2019 at 3:18 PM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> I wish to make a trained data for E13B font.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> I read the training tutorial and made a 
>>>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint file according to the method in 
>>>>>>>>>>>>>>>>>>>>>>>>>>> Training From Scratch.  
>>>>>>>>>>>>>>>>>>>>>>>>>>> Now, how can I make a trained data from the 
>>>>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint file?
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>>>>>>>>> You received this message because you are 
>>>>>>>>>>>>>>>>>>>>>>>>>>> subscribed to the Google Groups "tesseract-ocr" 
>>>>>>>>>>>>>>>>>>>>>>>>>>> group.
>>>>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop 
>>>>>>>>>>>>>>>>>>>>>>>>>>> receiving emails from it, send an email to 
>>>>>>>>>>>>>>>>>>>>>>>>>>> tesser...@googlegroups.com.
>>>>>>>>>>>>>>>>>>>>>>>>>>> To post to this group, send email to 
>>>>>>>>>>>>>>>>>>>>>>>>>>> tesser...@googlegroups.com.
>>>>>>>>>>>>>>>>>>>>>>>>>>> Visit this group at 
>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr.
>>>>>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>>>>> For more options, visit 
>>>>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/optout.
>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>> ____________________________________________________________
>>>>>>>>>>>>>>>>>>>>>>>>>> भजन - कीर्तन - आरती @ <a href="
>>>>>>>>>>>>>>>>>>>>>>>>>> http://bhajans.ramparivar.com"; rel="nofollow" 
>>>>>>>>>>>>>>>>>>>>>>>>>> target="_blank" onmousedown="this.href='
>>>>>>>>>>>>>>>>>>>>>>>>>> http://www.google.com/url?q\x3dh
>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/454f8f94-6000-4bf5-9129-9682cf1c6f65%40googlegroups.com.

Reply via email to