Re: [tesseract-ocr] Trained data for E13B font

Shree Devi Kumar Fri, 07 Jun 2019 09:52:13 -0700

Please also search for existing MICR traineddata files.

On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago <elmagoelg...@gmail.com> wrote:


> So I did several tests from scratch.  In the last attempt, I made a
> training text with 4,000 lines in the following format,
>
> 110004310510<   <02 :4002=0181:801= 0008752 <00039 ;0000001000;
>
>
> and combined it with eng.digits.training_text in which symbols are
> converted to E13B symbols.  This makes about 12,000 lines of training
> text.  It's amazing that this thing generates a good reader out of
> nowhere.  But then it is not very good.  For example:
>
> <01 :1901=1386:021= 1111001<10001< ;0000090134;
>
> is a result on the image attached.  It's close but the last '<' in the
> result text doesn't exist on the image.  It's a small failure but it causes
> a greater trouble in parsing.
>
> What would you suggest from here to increase accuracy?
>
>    - Increase the number of lines in the training text
>    - Mix up more variations in the training text
>    - Increase the number of iterations
>    - Investigate wrong reads one by one
>    - Or else?
>
> Also, I referred to engrestrict*.* and could generate similar result with
> the fine-tuning-from-full method.  It seems a bit faster to get to the same
> level but it also stops at a 'good' level.  I can go with either way if it
> takes me to the bright future.
>
> Regards,
> ElMagoElGato
>
> 2019年5月30日木曜日 15時56分02秒 UTC+9 ElGato ElMago:
>>
>> Thanks a lot, Shree. I'll look it in.
>>
>> 2019年5月30日木曜日 14時39分52秒 UTC+9 shree:
>>>
>>> See https://github.com/Shreeshrii/tessdata_shreetest
>>>
>>> Look at the files engrestrict*.* and also
>>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/eng.digits.training_text
>>>
>>> Create training text of about 100 lines and finetune for 400 lines
>>>
>>>
>>>
>>> On Thu, May 30, 2019 at 9:38 AM ElGato ElMago <elmago...@gmail.com>
>>> wrote:
>>>
>>>> I had about 14 lines as attached.  How many lines would you recommend?
>>>>
>>>> Fine tuning gives much better result but it tends to pick other
>>>> character than in E13B that only has 14 characters, 0 through 9 and 4
>>>> symbols.  I thought training from scratch would eliminate such confusion.
>>>>
>>>> 2019年5月30日木曜日 10時43分08秒 UTC+9 shree:
>>>>>
>>>>> For training from scratch a large training text and hundreds of
>>>>> thousands of iterations are recommended.
>>>>>
>>>>> If you are just fine tuning for a font try to follow instructions for
>>>>> training for impact, with your font.
>>>>>
>>>>>
>>>>> On Thu, 30 May 2019, 06:05 ElGato ElMago, <elmago...@gmail.com> wrote:
>>>>>
>>>>>> Thanks, Shree.
>>>>>>
>>>>>> Yes, I saw the instruction.  The steps I made are as follows:
>>>>>>
>>>>>> Using tesstrain.sh:
>>>>>> src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang eng
>>>>>> --linedata_only \
>>>>>>   --noextract_font_properties --langdata_dir ../langdata \
>>>>>>   --tessdata_dir ./tessdata \
>>>>>>   --fontlist "E13Bnsd" --output_dir ~/tesstutorial/e13beval \
>>>>>>   --training_text ../langdata/eng/eng.training_e13b_text
>>>>>>
>>>>>> Training from scratch:
>>>>>> mkdir -p ~/tesstutorial/e13boutput
>>>>>> src/training/lstmtraining --debug_interval 100 \
>>>>>>   --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>   --net_spec '[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256
>>>>>> O1c111]' \
>>>>>>   --model_output ~/tesstutorial/e13boutput/base --learning_rate 20e-4
>>>>>> \
>>>>>>   --train_listfile ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>>>   --eval_listfile ~/tesstutorial/e13beval/eng.training_files.txt \
>>>>>>   --max_iterations 5000 &>~/tesstutorial/e13boutput/basetrain.log
>>>>>>
>>>>>> Test with base_checkpoint:
>>>>>> src/training/lstmeval --model
>>>>>> ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>   --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>   --eval_listfile ~/tesstutorial/e13beval/eng.training_files.txt
>>>>>>
>>>>>> Combining output files:
>>>>>> src/training/lstmtraining --stop_training \
>>>>>>   --continue_from ~/tesstutorial/e13boutput/base_checkpoint \
>>>>>>   --traineddata ~/tesstutorial/e13beval/eng/eng.traineddata \
>>>>>>   --model_output ~/tesstutorial/e13boutput/eng.traineddata
>>>>>>
>>>>>> Test with eng.traineddata:
>>>>>> tesseract e13b.png out --tessdata-dir
>>>>>> /home/koichi/tesstutorial/e13boutput
>>>>>>
>>>>>>
>>>>>> The training from scratch ended as:
>>>>>>
>>>>>> At iteration 561/2500/2500, Mean rms=0.159%, delta=0%, char train=0%,
>>>>>> word train=0%, skip ratio=0%,  New best char error = 0 wrote best
>>>>>> model:/home/koichi/tesstutorial/e13boutput/base0_561.checkpoint wrote
>>>>>> checkpoint.
>>>>>>
>>>>>>
>>>>>> The test with base_checkpoint returns nothing as:
>>>>>>
>>>>>> At iteration 0, stage 0, Eval Char error rate=0, Word error rate=0
>>>>>>
>>>>>>
>>>>>> The test with eng.traineddata and e13b.png returns out.txt.  Both
>>>>>> files are attached.
>>>>>>
>>>>>> Training seems to have worked fine.  I don't know how to translate
>>>>>> the test result from base_checkpoint.  The generated eng.traineddata
>>>>>> obviously doesn't work well. I suspect the choice of --traineddata in
>>>>>> combining output files is bad but I have no clue.
>>>>>>
>>>>>> Regards,
>>>>>> ElMagoElGato
>>>>>>
>>>>>> BTW, I referred to your tess4training in the process.  It helped a
>>>>>> lot.
>>>>>>
>>>>>> 2019年5月29日水曜日 19時14分08秒 UTC+9 shree:
>>>>>>>
>>>>>>> see
>>>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#combining-the-output-files
>>>>>>>
>>>>>>> On Wed, May 29, 2019 at 3:18 PM ElGato ElMago <elmago...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I wish to make a trained data for E13B font.
>>>>>>>>
>>>>>>>> I read the training tutorial and made a base_checkpoint file
>>>>>>>> according to the method in Training From Scratch.  Now, how can I make 
>>>>>>>> a
>>>>>>>> trained data from the base_checkpoint file?
>>>>>>>>
>>>>>>>> --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>> send an email to tesser...@googlegroups.com.
>>>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>>> To view this discussion on the web visit
>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com
>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/4848cfa5-ae2b-4be3-a771-686aa0fec702%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>> .
>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> ____________________________________________________________
>>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to tesser...@googlegroups.com.
>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/7f29f47e-c6f5-4743-832d-94e7d28ab4e8%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesser...@googlegroups.com.
>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/2c6fe865-911d-41f3-9926-cbfb56db794f%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>> --
>>>
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/5b151e61-5b41-4191-8d26-784809ef8e10%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/5b151e61-5b41-4191-8d26-784809ef8e10%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUT-Ut2LA9h49u8J7SZumhrA%3DV__pwdVNJ%2B%2BpB_0juFsg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Trained data for E13B font

Reply via email to