Re: [tesseract-ocr] Trained data for E13B font

ElGato ElMago Tue, 23 Jul 2019 22:10:59 -0700

Hi,



I got this result from hocr.  This is where one of the phantom characters 
comes from.

<span class='ocrx_cinfo' title='x_bboxes 1259 902 1262 933; x_conf 
98.864532'>&lt;</span>
<span class='ocrx_cinfo' title='x_bboxes 1259 904 1281 933; x_conf 
99.018097'>;</span>


The firs character is the phantom.  It starts with the second character 
that exists on x axis.  The first character only has 3 points width.  I 
attach ScrollView screen shots that visualize this.

[image: 2019-07-24-132643_854x707_scrot.png][image: 
2019-07-24-132800_854x707_scrot.png]


There seem to be some more cases to cause phantom characters.  I'll look 
them in.  But I have a trivial question now.  I made ScrollView show these 
displays by accidentally clicking Display->Blamer menu.  There is Bounding 
Boxes menu below but it ends up showing a blue screen though it briefly 
shows boxes on the way.  Can I use this menu at all?  It'll be very useful.

[image: 2019-07-24-140739_854x707_scrot.png]


2019年7月23日火曜日 17時10分36秒 UTC+9 ElGato ElMago:
>
> It's great! Perfect!  Thanks a lot!
>
> 2019年7月23日火曜日 10時56分58秒 UTC+9 shree:
>>
>> See https://github.com/tesseract-ocr/tesseract/issues/2580
>>
>> On Tue, 23 Jul 2019, 06:23 ElGato ElMago, <elmago...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I read the output of hocr with lstm_choice_mode = 4 as to the pull 
>>> request 2554.  It shows the candidates for each character but doesn't show 
>>> bounding box of each character.  I only shows the box for a whole word.
>>>
>>> I see bounding boxes of each character in comments of the pull request 
>>> 2576.  How can I do that?  Do I have to look in the source code and 
>>> manipulate such an output on my own?
>>>
>>> 2019年7月19日金曜日 18時40分49秒 UTC+9 ElGato ElMago:
>>>
>>>> Lorenzo,
>>>>
>>>> I haven't been checking psm too much.  Will turn to those options after 
>>>> I see how it goes with bounding boxes.
>>>>
>>>> Shree,
>>>>
>>>> I see the merges in the git log and also see that new 
>>>> option lstm_choice_amount works now.  I guess my executable is latest 
>>>> though I still see the phantom character.  Hocr makes huge and complex 
>>>> output.  I'll take some to read it.
>>>>
>>>> 2019年7月19日金曜日 18時20分55秒 UTC+9 Claudiu:
>>>>>
>>>>> Is there any way to pass bounding boxes to use to the LSTM? We have an 
>>>>> algorithm that cleanly gets bounding boxes of MRZ characters. However the 
>>>>> results using psm 10 are worse than passing the whole line in. Yet when 
>>>>> we 
>>>>> pass the whole line in we get these phantom characters. 
>>>>>
>>>>> Should PSM 10 mode work? It often returns “no character” where there 
>>>>> clearly is one. I can supply a test case if it is expected to work well. 
>>>>>
>>>>> On Fri, Jul 19, 2019 at 11:06 AM ElGato ElMago <elmago...@gmail.com> 
>>>>> wrote:
>>>>>
>>>>>> Lorenzo,
>>>>>>
>>>>>> We both have got the same case.  It seems a solution to this problem 
>>>>>> would save a lot of people.
>>>>>>
>>>>>> Shree,
>>>>>>
>>>>>> I pulled the current head of master branch but it doesn't seem to 
>>>>>> contain the merges you pointed that have been merged 3 to 4 days ago.  
>>>>>> How 
>>>>>> can I get them?
>>>>>>
>>>>>> ElMagoElGato
>>>>>>
>>>>>> 2019年7月19日金曜日 17時02分53秒 UTC+9 Lorenzo Blz:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> PSM 7 was a partial solution for my specific case, it improved the 
>>>>>>> situation but did not solve it. Also I could not use it in some other 
>>>>>>> cases.
>>>>>>>
>>>>>>> The proper solution is very likely doing more training with more 
>>>>>>> data, some data augmentation might probably help if data is scarce.
>>>>>>> Also doing less training might help is the training is not done 
>>>>>>> correctly.
>>>>>>>
>>>>>>> There are also similar issues on github:
>>>>>>>
>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1465
>>>>>>> ...
>>>>>>>
>>>>>>> The LSTM engine works like this: it scans the image and for each 
>>>>>>> "pixel column" does this:
>>>>>>>
>>>>>>> M M M M N M M M [BLANK] F F F F
>>>>>>>
>>>>>>> (here i report only the highest probability characters)
>>>>>>>
>>>>>>> In the example above an M is partially seen as an N, this is normal, 
>>>>>>> and another step of the algorithm (beam search I think) tries to 
>>>>>>> aggregate 
>>>>>>> back the correct characters.
>>>>>>>
>>>>>>> I think cases like this:
>>>>>>>
>>>>>>> M M M N N N M M
>>>>>>>
>>>>>>> are what gives the phantom characters. More training should reduce 
>>>>>>> the source of the problem or a painful analysis of the bounding boxes 
>>>>>>> might 
>>>>>>> fix some cases.
>>>>>>>
>>>>>>>
>>>>>>> I used the attached script for the boxes.
>>>>>>>
>>>>>>>
>>>>>>> Lorenzo
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Il giorno ven 19 lug 2019 alle ore 07:25 ElGato ElMago <
>>>>>>> elmago...@gmail.com> ha scritto:
>>>>>>>
>>>>>> Hi,
>>>>>>>>
>>>>>>>> Let's call them phantom characters then.
>>>>>>>>
>>>>>>>> Was psm 7 the solution for the issue 1778?  None of the psm option 
>>>>>>>> didn't solve my problem though I see different output.
>>>>>>>>
>>>>>>>> I use tesseract 5.0-alpha mostly but 4.1 showed the same results 
>>>>>>>> anyway.  How did you get bounding box for each character?  Alto and 
>>>>>>>> lstmbox 
>>>>>>>> only show bbox for a group of characters.
>>>>>>>>
>>>>>>>> ElMagoElGato
>>>>>>>>
>>>>>>>> 2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz:
>>>>>>>>
>>>>>>>>> Phantom characters here for me too:
>>>>>>>>>
>>>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/1778
>>>>>>>>>
>>>>>>>>> Are you using 4.1? Bounding boxes were fixed in 4.1 maybe this was 
>>>>>>>>> also improved.
>>>>>>>>>
>>>>>>>>> I wrote some code that uses symbols iterator to discard symbols 
>>>>>>>>> that are clearly duplicated: too small, overlapping, etc. But it was 
>>>>>>>>> not 
>>>>>>>>> easy to make it work decently and it is not 100% reliable with false 
>>>>>>>>> negatives and positives. I cannot share the code and it is quite ugly 
>>>>>>>>> anyway.
>>>>>>>>>
>>>>>>>>> Here there is another MRZ model with training data:
>>>>>>>>>
>>>>>>>>> https://github.com/DoubangoTelecom/tesseractMRZ
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Lorenzo
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu <
>>>>>>>>> csaf...@gmail.com> ha scritto:
>>>>>>>>>
>>>>>>>>>> I’m getting the “phantom character” issue as well using the OCRB 
>>>>>>>>>> that Shree trained on MRZ lines. For example for a 0 it will 
>>>>>>>>>> sometimes add 
>>>>>>>>>> both a 0 and an O to the output , thus outputting 45 characters 
>>>>>>>>>> total 
>>>>>>>>>> instead of 44. I haven’t looked at the bounding box output yet but I 
>>>>>>>>>> suspect a phantom thin character is added somewhere that I can 
>>>>>>>>>> discard .. 
>>>>>>>>>> or maybe two chars will have the same bounding box. If anyone else 
>>>>>>>>>> has 
>>>>>>>>>> fixed this issue further up (eg so the output doesn’t contain the 
>>>>>>>>>> phantom 
>>>>>>>>>> characters in the first place) id be interested. 
>>>>>>>>>>
>>>>>>>>>> On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago <
>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I'll go back to more of training later.  Before doing so, I'd 
>>>>>>>>>>> like to investigate results a little bit.  The hocr and lstmbox 
>>>>>>>>>>> options 
>>>>>>>>>>> give some details of positions of characters.  The results show 
>>>>>>>>>>> positions 
>>>>>>>>>>> that perfectly correspond to letters in the image.  But the text 
>>>>>>>>>>> output 
>>>>>>>>>>> contains a character that obviously does not exist.
>>>>>>>>>>>
>>>>>>>>>>> Then I found a config file 'lstmdebug' that generates far more 
>>>>>>>>>>> information.  I hope it explains what happened with each character. 
>>>>>>>>>>>  I'm 
>>>>>>>>>>> yet to read the debug output but I'd appreciate it if someone could 
>>>>>>>>>>> tell me 
>>>>>>>>>>> how to read it because it's really complex.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>
>>>>>>>>>>> 2019年6月14日金曜日 19時58分49秒 UTC+9 shree:
>>>>>>>>>>>
>>>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_MICR
>>>>>>>>>>>>
>>>>>>>>>>>> I have uploaded my files there. 
>>>>>>>>>>>>
>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh
>>>>>>>>>>>> is the bash script that runs the training.
>>>>>>>>>>>>
>>>>>>>>>>>> You can modify as needed. Please note this is for legacy/base 
>>>>>>>>>>>> tesseract --oem 0.
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago <
>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks a lot, shree.  It seems you know everything.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I tried the MICR0.traineddata and the first two 
>>>>>>>>>>>>> mcr.traineddata.  The last one was blocked by the browser.  Each 
>>>>>>>>>>>>> of the 
>>>>>>>>>>>>> traineddata had mixed results.  All of them are getting symbols 
>>>>>>>>>>>>> fairly good 
>>>>>>>>>>>>> but getting spaces randomly and reading some numbers wrong.
>>>>>>>>>>>>>
>>>>>>>>>>>>> MICR0 seems the best among them.  Did you suggest that you'd 
>>>>>>>>>>>>> be able to update it?  It gets tripple D very often where there's 
>>>>>>>>>>>>> only one, 
>>>>>>>>>>>>> and so on.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Also, I tried to fine tune from MICR0 but I found that I need 
>>>>>>>>>>>>> to change the language-specific.sh.  It specifies some parameters 
>>>>>>>>>>>>> for each 
>>>>>>>>>>>>> language.  Do you have any guidance for it?
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> see 
>>>>>>>>>>>>>> http://www.devscope.net/Content/ocrchecks.aspx 
>>>>>>>>>>>>>> https://github.com/BigPino67/Tesseract-MICR-OCR
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ
>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago <
>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> That'll be nice if there's traineddata out there but I 
>>>>>>>>>>>>>>> didn't find any.  I see free fonts and commercial OCR software 
>>>>>>>>>>>>>>> but not 
>>>>>>>>>>>>>>> traineddata.  Tessdata repository obviously doesn't have one, 
>>>>>>>>>>>>>>> either.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please also search for existing MICR traineddata files.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago <
>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> So I did several tests from scratch.  In the last attempt, 
>>>>>>>>>>>>>>>>> I made a training text with 4,000 lines in the following 
>>>>>>>>>>>>>>>>> format,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 110004310510<   <02 :4002=0181:801= 0008752 <00039 
>>>>>>>>>>>>>>>>> ;0000001000;
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> and combined it with eng.digits.training_text in which 
>>>>>>>>>>>>>>>>> symbols are converted to E13B symbols.  This makes about 
>>>>>>>>>>>>>>>>> 12,000 lines of 
>>>>>>>>>>>>>>>>> training text.  It's amazing that this thing generates a good 
>>>>>>>>>>>>>>>>> reader out of 
>>>>>>>>>>>>>>>>> nowhere.  But then it is not very good.  For example:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> <01 :1901=1386:021= 1111001<10001< ;0000090134;
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> is a result on the image attached.  It's close but the 
>>>>>>>>>>>>>>>>> last '<' in the result text doesn't exist on the image.  It's 
>>>>>>>>>>>>>>>>> a small 
>>>>>>>>>>>>>>>>> failure but it causes a greater trouble in parsing.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> What would you suggest from here to increase accuracy?  
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    - Increase the number of lines in the training text
>>>>>>>>>>>>>>>>>    - Mix up more variations in the training text
>>>>>>>>>>>>>>>>>    - Increase the number of iterations
>>>>>>>>>>>>>>>>>    - Investigate wrong reads one by one
>>>>>>>>>>>>>>>>>    - Or else?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Also, I referred to engrestrict*.* and could generate 
>>>>>>>>>>>>>>>>> similar result with the fine-tuning-from-full method.  It 
>>>>>>>>>>>>>>>>> seems a bit 
>>>>>>>>>>>>>>>>> faster to get to the same level but it also stops at a 'good' 
>>>>>>>>>>>>>>>>> level.  I can 
>>>>>>>>>>>>>>>>> go with either way if it takes me to the bright future.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks a lot, Shree. I'll look it in.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> See https://github.com/Shreeshrii/tessdata_shreetest
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Look at the files engrestrict*.* and also 
>>>>>>>>>>>>>>>>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Create training text of about 100 lines and finetune for 
>>>>>>>>>>>>>>>>>>> 400 lines 
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago <
>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I had about 14 lines as attached.  How many lines would 
>>>>>>>>>>>>>>>>>>>> you recommend?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Fine tuning gives much better result but it tends to 
>>>>>>>>>>>>>>>>>>>> pick other character than in E13B that only has 14 
>>>>>>>>>>>>>>>>>>>> characters, 0 through 9 
>>>>>>>>>>>>>>>>>>>> and 4 symbols.  I thought training from scratch would 
>>>>>>>>>>>>>>>>>>>> eliminate such 
>>>>>>>>>>>>>>>>>>>> confusion.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> For training from scratch a large training text and 
>>>>>>>>>>>>>>>>>>>>> hundreds of thousands of iterations are recommended. 
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> If you are just fine tuning for a font try to follow 
>>>>>>>>>>>>>>>>>>>>> instructions for training for impact, with your font.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, <
>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Thanks, Shree.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Yes, I saw the instruction.  The steps I made are as 
>>>>>>>>>>>>>>>>>>>>>> follows:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Using tesstrain.sh:
>>>>>>>>>>>>>>>>>>>>>> src/training/tesstrain.sh --fonts_dir 
>>>>>>>>>>>>>>>>>>>>>> /usr/share/fonts --lang eng --linedata_only \
>>>>>>>>>>>>>>>>>>>>>>   --noextract_font_properties --langdata_dir 
>>>>>>>>>>>>>>>>>>>>>> ../langdata \
>>>>>>>>>>>>>>>>>>>>>>   --tessdata_dir ./tessdata \
>>>>>>>>>>>>>>>>>>>>>>   --fontlist "E13Bnsd" --output_dir 
>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval \
>>>>>>>>>>>>>>>>>>>>>>   --training_text 
>>>>>>>>>>>>>>>>>>>>>> ../langdata/eng/eng.training_e13b_text
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Training from scratch:
>>>>>>>>>>>>>>>>>>>>>> mkdir -p ~/tesstutorial/e13boutput
>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --debug_interval 100 \
>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>   --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 
>>>>>>>>>>>>>>>>>>>>>> Lrx96 Lfx256 O1c111]' \
>>>>>>>>>>>>>>>>>>>>>>   --model_output ~/tesstutorial/e13boutput/base 
>>>>>>>>>>>>>>>>>>>>>> --learning_rate 20e-4 \
>>>>>>>>>>>>>>>>>>>>>>   --train_listfile 
>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>>>>>>>>>>>>>>>>>>>   --eval_listfile 
>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>>>>>>>>>>>>>>>>>>>   --max_iterations 5000 
>>>>>>>>>>>>>>>>>>>>>> &>~/tesstutorial/e13boutput/basetrain.log
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Test with base_checkpoint:
>>>>>>>>>>>>>>>>>>>>>> src/training/lstmeval --model 
>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>   --eval_listfile 
>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng.training_files.txt
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Combining output files:
>>>>>>>>>>>>>>>>>>>>>> src/training/lstmtraining --stop_training \
>>>>>>>>>>>>>>>>>>>>>>   --continue_from 
>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>>>>>>>>>>>>>>>>   --traineddata 
>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>>>>>>>>>>>>>>>>   --model_output 
>>>>>>>>>>>>>>>>>>>>>> ~/tesstutorial/e13boutput/eng.traineddata
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Test with eng.traineddata:
>>>>>>>>>>>>>>>>>>>>>> tesseract e13b.png out --tessdata-dir 
>>>>>>>>>>>>>>>>>>>>>> /home/koichi/tesstutorial/e13boutput
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> The training from scratch ended as:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> At iteration 561/2500/2500, Mean rms=0.159%, 
>>>>>>>>>>>>>>>>>>>>>> delta=0%, char train=0%, word train=0%, skip ratio=0%,  
>>>>>>>>>>>>>>>>>>>>>> New best char error 
>>>>>>>>>>>>>>>>>>>>>> = 0 wrote best 
>>>>>>>>>>>>>>>>>>>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint
>>>>>>>>>>>>>>>>>>>>>>  wrote 
>>>>>>>>>>>>>>>>>>>>>> checkpoint.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> The test with base_checkpoint returns nothing as:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> At iteration 0, stage 0, Eval Char error rate=0, Word 
>>>>>>>>>>>>>>>>>>>>>> error rate=0
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> The test with eng.traineddata and e13b.png returns 
>>>>>>>>>>>>>>>>>>>>>> out.txt.  Both files are attached.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Training seems to have worked fine.  I don't know how 
>>>>>>>>>>>>>>>>>>>>>> to translate the test result from base_checkpoint.  The 
>>>>>>>>>>>>>>>>>>>>>> generated 
>>>>>>>>>>>>>>>>>>>>>> eng.traineddata obviously doesn't work well. I suspect 
>>>>>>>>>>>>>>>>>>>>>> the choice of 
>>>>>>>>>>>>>>>>>>>>>> --traineddata in combining output files is bad but I 
>>>>>>>>>>>>>>>>>>>>>> have no clue.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>>>>> ElMagoElGato
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> BTW, I referred to your tess4training in the 
>>>>>>>>>>>>>>>>>>>>>> process.  It helped a lot.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> 2019年5月29日水曜日 19時14分08秒 UTC+9 shree:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> see 
>>>>>>>>>>>>>>>>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> On Wed, May 29, 2019 at 3:18 PM ElGato ElMago <
>>>>>>>>>>>>>>>>>>>>>>> elmago...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I wish to make a trained data for E13B font.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> I read the training tutorial and made a 
>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint file according to the method in 
>>>>>>>>>>>>>>>>>>>>>>>> Training From Scratch.  
>>>>>>>>>>>>>>>>>>>>>>>> Now, how can I make a trained data from the 
>>>>>>>>>>>>>>>>>>>>>>>> base_checkpoint file?
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>>>>>> You received this message because you are 
>>>>>>>>>>>>>>>>>>>>>>>> subscribed to the Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving 
>>>>>>>>>>>>>>>>>>>>>>>> emails from it, send an email to 
>>>>>>>>>>>>>>>>>>>>>>>> tesser...@googlegroups.com.
>>>>>>>>>>>>>>>>>>>>>>>> To post to this group, send email to 
>>>>>>>>>>>>>>>>>>>>>>>> tesser...@googlegroups.com.
>>>>>>>>>>>>>>>>>>>>>>>> Visit this group at 
>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr.
>>>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>>>> For more options, visit 
>>>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/optout.
>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> ____________________________________________________________
>>>>>>>>>>>>>>>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>>>> You received this message because you are subscribed 
>>>>>>>>>>>>>>>>>>>>>> to the Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving 
>>>>>>>>>>>>>>>>>>>>>> emails from it, send an email to 
>>>>>>>>>>>>>>>>>>>>>> tesser...@googlegroups.com.
>>>>>>>>>>>>>>>>>>>>>> To post to this group, send email to 
>>>>>>>>>>>>>>>>>>>>>> tesser...@googlegroups.com.
>>>>>>>>>>>>>>>>>>>>>> Visit this group at 
>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr.
>>>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>> For more options, visit 
>>>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/optout.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>> You received this message because you are subscribed to 
>>>>>>>>>>>>>>>>>>>> the Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving 
>>>>>>>>>>>>>>>>>>>> emails from it, send an email to 
>>>>>>>>>>>>>>>>>>>> tesser...@googlegroups.com.
>>>>>>>>>>>>>>>>>>>> To post to this group, send email to 
>>>>>>>>>>>>>>>>>>>> tesser...@googlegroups.com.
>>>>>>>>>>>>>>>>>>>> Visit this group at 
>>>>>>>>>>>>>>>>>>>> https://groups.google.com/group/tesseract-ocr.
>>>>>>>>>>>>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com
>>>>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>> For more options, visit 
>>>>>>>>>>>>>>>>>>>> https://groups.google.com/d/optout.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> ____________________________________________________________
>>>>>>>>>>>>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>> You received this message because you are subscribed to 
>>>>>>>>>>>>>>>>> the Google Groups "tesseract-ocr" group.
>>>>>>>>>>>>>>>>> To unsubscribe from this group and stop receiving emails 
>>>>>>>>>>>>>>>>> from it, send an email to tesser...@googlegroups.com.
>>>>>>>>>>>>>>>>> To post to this g
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7f1fd2ea-3cd9-4d75-a037-2b2390c4271d%40googlegroups.com.

Re: [tesseract-ocr] Trained data for E13B font

Reply via email to