Re: [tesseract-ocr] train more fonts on trained model fas in tesseract

ShreeDevi Kumar Tue, 15 May 2018 02:17:25 -0700

Please use the latest windows binaries from
https://github.com/UB-Mannheim/tesseract/wiki provided by @stweil


How do you run bash script on windows10?

@stweil I have not tried training on windows? Do you have feedback from
others who have tried it.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, May 15, 2018 at 2:41 PM, reza <reza6...@gmail.com> wrote:

> windows 10
> tesseract 4 alpha
>
>
> On Tuesday, May 15, 2018 at 1:12:20 PM UTC+4:30, shree wrote:
>>
>> What o/s are you running it on?
>>
>> Which version of tesseract?
>>
>> > ICU ERROR: U_FILE_ACCESS_ERRORERROR: /tmp/tmp.6m4B2TUln1/eng/eng.unicharset
>> does not exist or is not readable
>>
>> which version of icu library?
>>
>> ShreeDevi
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Tue, May 15, 2018 at 1:00 PM, reza <reza...@gmail.com> wrote:
>>
>>> i used this attached finetune.sh file ... but that raised error. could u
>>> help me ?
>>>
>>> thanks
>>>
>>>
>>>> ###### MAKING TRAINING DATA ######
>>>>
>>>>
>>>>> === Starting training for language 'eng'
>>>>
>>>> [Tue, May 15, 2018 11:42:36 AM] /c/Program Files
>>>>> (x86)/Tesseract-OCR/text2image --fonts_dir=C:WindowsFonts --font=Arial
>>>>> --outputbase=/tmp/font_tmp.CpgpM0lbxD/sample_text.txt
>>>>> --text=/tmp/font_tmp.CpgpM0lbxD/sample_text.txt
>>>>> --fontconfig_tmpdir=/tmp/font_tmp.CpgpM0lbxD
>>>>
>>>> Rendered page 0 to file C:/Users/asus/AppData/Local/Te
>>>>> mp/font_tmp.CpgpM0lbxD/sample_text.txt.tif
>>>>
>>>>
>>>>> === Phase I: Generating training images ===
>>>>
>>>> Rendering using Arial
>>>>
>>>> Rendering using Corbel
>>>>
>>>> [Tue, May 15, 2018 11:42:37 AM] /c/Program Files
>>>>> (x86)/Tesseract-OCR/text2image 
>>>>> --fontconfig_tmpdir=/tmp/font_tmp.CpgpM0lbxD
>>>>> --fonts_dir=C:WindowsFonts --strip_unrenderable_words --leading=32
>>>>> --char_spacing=0.0 --exposure=0 
>>>>> --outputbase=/tmp/tmp.6m4B2TUln1/eng/eng.Arial.exp0
>>>>> --max_pages=3 --font=Arial --text=./langdata/eng/eng.training_text
>>>>
>>>> [Tue, May 15, 2018 11:42:37 AM] /c/Program Files
>>>>> (x86)/Tesseract-OCR/text2image 
>>>>> --fontconfig_tmpdir=/tmp/font_tmp.CpgpM0lbxD
>>>>> --fonts_dir=C:WindowsFonts --strip_unrenderable_words --leading=32
>>>>> --char_spacing=0.0 --exposure=0 
>>>>> --outputbase=/tmp/tmp.6m4B2TUln1/eng/eng.Corbel.exp0
>>>>> --max_pages=3 --font=Corbel --text=./langdata/eng/eng.training_text
>>>>
>>>> Stripped 2 unrenderable words
>>>>
>>>> Rendered page 0 to file C:/Users/asus/AppData/Local/Te
>>>>> mp/tmp.6m4B2TUln1/eng/eng.Arial.exp0.tif
>>>>
>>>> Stripped 1 unrenderable words
>>>>
>>>> Rendered page 1 to file C:/Users/asus/AppData/Local/Te
>>>>> mp/tmp.6m4B2TUln1/eng/eng.Arial.exp0.tif
>>>>
>>>> Stripped 2 unrenderable words
>>>>
>>>> Rendered page 0 to file C:/Users/asus/AppData/Local/Te
>>>>> mp/tmp.6m4B2TUln1/eng/eng.Corbel.exp0.tif
>>>>
>>>> Stripped 1 unrenderable words
>>>>
>>>> Rendered page 1 to file C:/Users/asus/AppData/Local/Te
>>>>> mp/tmp.6m4B2TUln1/eng/eng.Corbel.exp0.tif
>>>>
>>>>
>>>>> === Phase UP: Generating unicharset and unichar properties files ===
>>>>
>>>> [Tue, May 15, 2018 11:42:39 AM] /c/Program Files
>>>>> (x86)/Tesseract-OCR/unicharset_extractor --output_unicharset
>>>>> /tmp/tmp.6m4B2TUln1/eng/eng.unicharset --norm_mode 1
>>>>> /tmp/tmp.6m4B2TUln1/eng/eng.Arial.exp0.box
>>>>> /tmp/tmp.6m4B2TUln1/eng/eng.Corbel.exp0.box
>>>>
>>>> Extracting unicharset from box file C:/Users/asus/AppData/Local/Te
>>>>> mp/tmp.6m4B2TUln1/eng/eng.Arial.exp0.box
>>>>
>>>> Extracting unicharset from box file C:/Users/asus/AppData/Local/Te
>>>>> mp/tmp.6m4B2TUln1/eng/eng.Corbel.exp0.box
>>>>
>>>> ICU ERROR: U_FILE_ACCESS_ERRORERROR: /tmp/tmp.6m4B2TUln1/eng/eng.unicharset
>>>>> does not exist or is not readable
>>>>
>>>> ###### MAKING EVAL DATA ######
>>>>
>>>>
>>>>> === Starting training for language 'eng'
>>>>
>>>> [Tue, May 15, 2018 11:42:40 AM] /c/Program Files
>>>>> (x86)/Tesseract-OCR/text2image --fonts_dir=C:WindowsFonts --font=Calibri
>>>>> --outputbase=/tmp/font_tmp.n0qq4iJk4q/sample_text.txt
>>>>> --text=/tmp/font_tmp.n0qq4iJk4q/sample_text.txt
>>>>> --fontconfig_tmpdir=/tmp/font_tmp.n0qq4iJk4q
>>>>
>>>> Rendered page 0 to file C:/Users/asus/AppData/Local/Te
>>>>> mp/font_tmp.n0qq4iJk4q/sample_text.txt.tif
>>>>
>>>>
>>>>> === Phase I: Generating training images ===
>>>>
>>>> Rendering using Calibri
>>>>
>>>> [Tue, May 15, 2018 11:42:40 AM] /c/Program Files
>>>>> (x86)/Tesseract-OCR/text2image 
>>>>> --fontconfig_tmpdir=/tmp/font_tmp.n0qq4iJk4q
>>>>> --fonts_dir=C:WindowsFonts --strip_unrenderable_words --leading=32
>>>>> --char_spacing=0.0 --exposure=0 
>>>>> --outputbase=/tmp/tmp.h0l64TAxEq/eng/eng.Calibri.exp0
>>>>> --max_pages=3 --font=Calibri --text=./langdata/eng/eng.training_text
>>>>
>>>> Stripped 2 unrenderable words
>>>>
>>>> Rendered page 0 to file C:/Users/asus/AppData/Local/Te
>>>>> mp/tmp.h0l64TAxEq/eng/eng.Calibri.exp0.tif
>>>>
>>>> Stripped 1 unrenderable words
>>>>
>>>> Rendered page 1 to file C:/Users/asus/AppData/Local/Te
>>>>> mp/tmp.h0l64TAxEq/eng/eng.Calibri.exp0.tif
>>>>
>>>>
>>>>> === Phase UP: Generating unicharset and unichar properties files ===
>>>>
>>>> [Tue, May 15, 2018 11:42:42 AM] /c/Program Files
>>>>> (x86)/Tesseract-OCR/unicharset_extractor --output_unicharset
>>>>> /tmp/tmp.h0l64TAxEq/eng/eng.unicharset --norm_mode 1
>>>>> /tmp/tmp.h0l64TAxEq/eng/eng.Calibri.exp0.box
>>>>
>>>> Extracting unicharset from box file C:/Users/asus/AppData/Local/Te
>>>>> mp/tmp.h0l64TAxEq/eng/eng.Calibri.exp0.box
>>>>
>>>> ICU ERROR: U_FILE_ACCESS_ERRORERROR: /tmp/tmp.h0l64TAxEq/eng/eng.unicharset
>>>>> does not exist or is not readable
>>>>
>>>> #### combine_tessdata to extract lstm model from previous trained set
>>>>> ####
>>>>
>>>> Extracting tessdata components from ./tessdata_best/eng.traineddata
>>>>
>>>> Wrote ./trained_plus_chars/eng.lstm
>>>>
>>>> Version string:4.00.00alpha:eng:synth20170629
>>>>
>>>> 17:lstm:size=401636, offset=192
>>>>
>>>> 18:lstm-punc-dawg:size=4322, offset=401828
>>>>
>>>> 19:lstm-word-dawg:size=3694794, offset=406150
>>>>
>>>> 20:lstm-number-dawg:size=4738, offset=4100944
>>>>
>>>> 21:lstm-unicharset:size=6360, offset=4105682
>>>>
>>>> 22:lstm-recoder:size=1012, offset=4112042
>>>>
>>>> 23:version:size=30, offset=4113054
>>>>
>>>> #### training from previous optimum  #####
>>>>
>>>> finetune.sh: line 119: 11664 Segmentation fault      lstmtraining
>>>>> --model_output $train_output_dir/pluschars --continue_from
>>>>> $train_output_dir/$Lang.lstm --old_traineddata
>>>>> $tessdata_dir/$Lang.traineddata --traineddata
>>>>> $train_output_dir/$Lang/$Lang.traineddata --max_iterations
>>>>> $MaxIterations --debug_interval -1 --eval_listfile
>>>>> $eval_output_dir/$Lang.training_files.txt --train_listfile
>>>>> $train_output_dir/$Lang.training_files.txt
>>>>
>>>> #### Building final trained file ./trained_plus_chars/eng_NEW.traineddata
>>>>> d####
>>>>
>>>> finetune.sh: line 130: 11320 Segmentation fault      lstmtraining
>>>>> --stop_training --continue_from $train_output_dir/pluschars_checkpoint
>>>>> --traineddata $train_output_dir/$Lang/$Lang.traineddata
>>>>> --model_output $final_trained_data_file
>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/7c46c196-e08d-4541-9f3b-b8a768792c9a%40goo
>>> glegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/7c46c196-e08d-4541-9f3b-b8a768792c9a%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/3851abc9-90b5-4a09-a01f-ffbd583e6bab%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/3851abc9-90b5-4a09-a01f-ffbd583e6bab%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWY0b-Lw%2BoMpC8%3DpFMj4xvbfVtf3ovrgVT%2BckrrEmOhyw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] train more fonts on trained model fas in tesseract

Reply via email to