Re: [tesseract-ocr] Trained data for E13B font

ElGato ElMago Sun, 09 Jun 2019 22:52:34 -0700

That'll be nice if there's traineddata out there but I didn't find any.  I 
see free fonts and commercial OCR software but not traineddata.  Tessdata 
repository obviously doesn't have one, either.


2019年6月8日土曜日 1時52分10秒 UTC+9 shree:
>
> Please also search for existing MICR traineddata files.
>
> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago <elmago...@gmail.com 
> <javascript:>> wrote:
>
>> So I did several tests from scratch.  In the last attempt, I made a 
>> training text with 4,000 lines in the following format,
>>
>> 110004310510<   <02 :4002=0181:801= 0008752 <00039 ;0000001000;
>>
>>
>> and combined it with eng.digits.training_text in which symbols are 
>> converted to E13B symbols.  This makes about 12,000 lines of training 
>> text.  It's amazing that this thing generates a good reader out of 
>> nowhere.  But then it is not very good.  For example:
>>
>> <01 :1901=1386:021= 1111001<10001< ;0000090134;
>>
>> is a result on the image attached.  It's close but the last '<' in the 
>> result text doesn't exist on the image.  It's a small failure but it causes 
>> a greater trouble in parsing.
>>
>> What would you suggest from here to increase accuracy?  
>>
>>    - Increase the number of lines in the training text
>>    - Mix up more variations in the training text
>>    - Increase the number of iterations
>>    - Investigate wrong reads one by one
>>    - Or else?
>>
>> Also, I referred to engrestrict*.* and could generate similar result with 
>> the fine-tuning-from-full method.  It seems a bit faster to get to the same 
>> level but it also stops at a 'good' level.  I can go with either way if it 
>> takes me to the bright future.
>>
>> Regards,
>> ElMagoElGato
>>
>> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago:
>>>
>>> Thanks a lot, Shree. I'll look it in.
>>>
>>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree:
>>>>
>>>> See https://github.com/Shreeshrii/tessdata_shreetest
>>>>
>>>> Look at the files engrestrict*.* and also 
>>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text
>>>>
>>>> Create training text of about 100 lines and finetune for 400 lines 
>>>>
>>>>
>>>>
>>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago <elmago...@gmail.com> 
>>>> wrote:
>>>>
>>>>> I had about 14 lines as attached.  How many lines would you recommend?
>>>>>
>>>>> Fine tuning gives much better result but it tends to pick other 
>>>>> character than in E13B that only has 14 characters, 0 through 9 and 4 
>>>>> symbols.  I thought training from scratch would eliminate such confusion.
>>>>>
>>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree:
>>>>>>
>>>>>> For training from scratch a large training text and hundreds of 
>>>>>> thousands of iterations are recommended. 
>>>>>>
>>>>>> If you are just fine tuning for a font try to follow instructions for 
>>>>>> training for impact, with your font.
>>>>>>
>>>>>>
>>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, <elmago...@gmail.com> 
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks, Shree.
>>>>>>>
>>>>>>> Yes, I saw the instruction.  The steps I made are as follows:
>>>>>>>
>>>>>>> Using tesstrain.sh:
>>>>>>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng 
>>>>>>> --linedata_only \
>>>>>>>   --noextract_font_properties --langdata_dir ../langdata \
>>>>>>>   --tessdata_dir ./tessdata \
>>>>>>>   --fontlist "E13Bnsd" --output_dir ~/tesstutorial/e13beval \
>>>>>>>   --training_text ../langdata/eng/eng.training_e13b_text
>>>>>>>
>>>>>>> Training from scratch:
>>>>>>> mkdir -p ~/tesstutorial/e13boutput
>>>>>>> src/training/lstmtraining --debug_interval 100 \
>>>>>>>   --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>   --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 
>>>>>>> O1c111]' \
>>>>>>>   --model_output ~/tesstutorial/e13boutput/base --learning_rate 
>>>>>>> 20e-4 \
>>>>>>>   --train_listfile ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>>>>   --eval_listfile ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>>>>   --max_iterations 5000 &>~/tesstutorial/e13boutput/basetrain.log
>>>>>>>
>>>>>>> Test with base_checkpoint:
>>>>>>> src/training/lstmeval --model 
>>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>   --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>   --eval_listfile ~/tesstutorial/e13beval/eng.training_files.txt
>>>>>>>
>>>>>>> Combining output files:
>>>>>>> src/training/lstmtraining --stop_training \
>>>>>>>   --continue_from ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>>   --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>>   --model_output ~/tesstutorial/e13boutput/eng.traineddata
>>>>>>>
>>>>>>> Test with eng.traineddata:
>>>>>>> tesseract e13b.png out --tessdata-dir 
>>>>>>> /home/koichi/tesstutorial/e13boutput
>>>>>>>
>>>>>>>
>>>>>>> The training from scratch ended as:
>>>>>>>
>>>>>>> At iteration 561/2500/2500, Mean rms=0.159%, delta=0%, char 
>>>>>>> train=0%, word train=0%, skip ratio=0%,  New best char error = 0 wrote 
>>>>>>> best 
>>>>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint wrote 
>>>>>>> checkpoint.
>>>>>>>
>>>>>>>
>>>>>>> The test with base_checkpoint returns nothing as:
>>>>>>>
>>>>>>> At iteration 0, stage 0, Eval Char error rate=0, Word error rate=0
>>>>>>>
>>>>>>>
>>>>>>> The test with eng.traineddata and e13b.png returns out.txt.  Both 
>>>>>>> files are attached.
>>>>>>>
>>>>>>> Training seems to have worked fine.  I don't know how to translate 
>>>>>>> the test result from base_checkpoint.  The generated eng.traineddata 
>>>>>>> obviously doesn't work well. I suspect the choice of --traineddata in 
>>>>>>> combining output files is bad but I have no clue.
>>>>>>>
>>>>>>> Regards,
>>>>>>> ElMagoElGato
>>>>>>>
>>>>>>> BTW, I referred to your tess4training in the process.  It helped a 
>>>>>>> lot.
>>>>>>>
>>>>>>> 2019年5月29日水曜日 19時14分08秒 UTC+9 shree:
>>>>>>>>
>>>>>>>> see 
>>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files
>>>>>>>>
>>>>>>>> On Wed, May 29, 2019 at 3:18 PM ElGato ElMago <elmago...@gmail.com> 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I wish to make a trained data for E13B font.
>>>>>>>>>
>>>>>>>>> I read the training tutorial and made a base_checkpoint file 
>>>>>>>>> according to the method in Training From Scratch.  Now, how can I 
>>>>>>>>> make a 
>>>>>>>>> trained data from the base_checkpoint file?
>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>>>> send an email to tesser...@googlegroups.com.
>>>>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>>>> To view this discussion on the web visit 
>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com
>>>>>>>>>  
>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>> .
>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -- 
>>>>>>>>
>>>>>>>> ____________________________________________________________
>>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>>
>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to tesser...@googlegroups.com.
>>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>> To view this discussion on the web visit 
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com
>>>>>>>  
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to tesser...@googlegroups.com.
>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>
>>>> -- 
>>>>
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesser...@googlegroups.com <javascript:>.
>> To post to this group, send email to tesser...@googlegroups.com 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/5b151e61-5b41-4191-8d26-784809ef8e10%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/5b151e61-5b41-4191-8d26-784809ef8e10%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> -- 
>
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/09d3119c-d093-4269-bf3a-3ddb467ed0ed%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Trained data for E13B font

Reply via email to