Re: [tesseract-ocr] I tried to train a traineddata file myself, but encountered an [Error]

鹿青年 Thu, 05 Dec 2024 23:37:21 -0800

Thanks for the reply.
Yes, I also use jeTesBoxEditor at the same time, but jeTesBoxEditor is more 
like data standardization. Some of the font files have incomplete fonts. I 
want to use LSTM training to train a complete autologous library file of my 
own.
在2024年12月6日星期五 UTC+8 15:15:40<mahmoud...@gmail.com> 写道：


> I think using jeTesBoxEditor is good for training process
>
> في الجمعة، ٦ ديسمبر ٢٠٢٤، ١١:٠٧ ص Zdenko Podobny <zde...@gmail.com> كتب:
>
>>
>> Error: Tesseract (legacy) engine requested, but components are not 
>> present in /usr/local/share/tessdata/my_chi_sim.traineddata!!
>>
>>
>> The message is clear. YOU require tesseract to use legacy engine 
>> explicitly but YOUR language datafile (you created by training) does not 
>> contain legacy model.
>>
>> Zdenko
>>
>>
>> pi 6. 12. 2024 o 7:11 鹿青年 <luqingn...@gmail.com> napísal(a):
>>
>>> Hello, I tried to train a traineddata file myself, but an [Error] 
>>> occurred during use. Could you please give me some guidance on how to 
>>> resolve this error? Thank you very much.
>>> Perform OCR
>>> ···
>>> tesseract 0791.tif stdout -l my_chi_sim --psm 6 --oem 2
>>> ···
>>> The error content is:
>>> ····
>>> Error: Tesseract (legacy) engine requested, but components are not 
>>> present in /usr/local/share/tessdata/my_chi_sim.traineddata!!
>>> Failed loading language 'my_chi_sim'
>>> Tesseract couldn't load any languages!
>>> Could not initialize tesseract.
>>> ····
>>>
>>> My training steps are as follows:
>>>
>>> Punctuation Dictionary:
>>> dawg2wordlist d:\tesseract\tessdata_best\chi_sim.lstm-unicharset 
>>> d:\tesseract\tessdata_best\chi_sim.lstm-punc-dawg 
>>> d:\tesseract\tessdata_best\punc.txt
>>>
>>>
>>> Let’s start with the key steps
>>> 2. Generate character set lstm-unicharset file
>>> 1. Generate character set txt file
>>>
>>> text2image --text d:\tesseract\chi_sim.txt --outputbase 
>>> d:\tesseract\chi_sim --fonts_dir C:\Windows\Fonts --font="simhei" 
>>> --fontconfig_tmpdir d:\tesseract\tmp
>>>
>>>
>>> 3. Generate character set lstm-unicharset file
>>>
>>> 1) Generate with box file
>>> unicharset_extractor --norm_mode 3 --output_unicharset 
>>> d:\tesseract\chi_sim.lstm-unicharset d:\tesseract\chi_sim.box
>>>
>>> 2) Generate with txt file
>>> unicharset_extractor --norm_mode 3 --output_unicharset 
>>> d:\tesseract\chi_sim.lstm-unicharset d:\tesseract\chi_sim.txt
>>>
>>>
>>> 3. Generate starter traineddata file
>>> 1. Generate dictionary text file
>>> Refer to the 3 dictionary files in the d:\tesseract\tessdata_best folder 
>>> (word text, number numbers, punc punctuation marks)
>>> 2. Generate starter traineddata file
>>> combine_lang_model --input_unicharset 
>>> d:\tesseract\chi_sim.lstm-unicharset --lang chi_sim --script_dir 
>>> d:\tesseract\langdata_lstm --output_dir d:\tesseract --version_str 
>>> "CSDN:watt:2022.04[1,48,0,1C3,3Ft16Mp3,3TxyLfys64Lfx96RxLrx96Lfx512O1c4000]"
>>>  
>>> --words d:\tesseract\word.txt --numbers d:\tesseract\number.txt --puncs 
>>> d:\tesseract\punc.txt --pass_through_recoder
>>>
>>>
>>> 3. View the newly generated starter trained data information
>>> combine_tessdata -d d:\tesseract\chi_sim\chi_sim.traineddata
>>>
>>> 4. Generate training files
>>> 1. Generate the training text file train.txt
>>>
>>> 2. Generate picture+box file
>>>
>>> text2image --text d:\tesseract\train.txt --outputbase d:\tesseract\train 
>>> --fonts_dir C:\Windows\Fonts --font="simhei" --ptsize 18 
>>> --fontconfig_tmpdir d:\tesseract\tmp
>>> 3. Generate training files:
>>> tesseract d:\tesseract\train.tif d:\tesseract\train -l chi_sim --psm 6 
>>> lstm.train
>>>
>>> 4. Create a new training list file
>>> Create a new d:\tesseract\train_listfile.txt file with the content 
>>> d:\tesseract\train.lstmf
>>> 5. Training
>>>
>>> 2. Start training:
>>> lstmtraining --traineddata d:\tesseract\chi_sim\chi_sim.traineddata 
>>> --net_spec "[1,48,0,1Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c4000]" 
>>> --model_output d:\tesseract\output\output --train_listfile 
>>> d:\tesseract\train_listfile.txt --max_iterations 0 --target_error_rate 0.01 
>>> --debug_interval -1
>>>
>>> 6. Evaluate the generated checkpoint file
>>> 1. Generate evaluation text eval.txt
>>> Edit some evaluation text and save it to d:\tesseract\eval.txt, so as to 
>>> cover it as comprehensively as possible and with a certain degree of 
>>> complexity.
>>> 2. Generate picture+box file
>>> text2image --text d:\tesseract\eval.txt --outputbase d:\tesseract\eval 
>>> --fonts_dir C:\Windows\Fonts --font="simhei" --ptsize 18 
>>> --fontconfig_tmpdir d:\tesseract\tmp
>>> 3. Generate evaluation lstmf file
>>> tesseract d:\tesseract\eval.tif d:\tesseract\eval -l chi_sim --psm 6 
>>> lstm.train
>>> 4. Generate evaluation list file
>>> Create a new d:\tesseract\eval_listfile.txt file with the content 
>>> d:\tesseract\eval.lstmf
>>> 5. Start evaluating
>>>
>>> Start evaluating:
>>> lstmeval --model d:\tesseract\output\output_checkpoint --traineddata 
>>> d:\tesseract\chi_sim\chi_sim.traineddata --eval_listfile 
>>> d:\tesseract\eval_listfile.txt
>>> 7. Generate standard trained data
>>> 1. Generate a floating point (decimal) traineddata file (similar to 
>>> tessdata_best)
>>> lstmtraining --stop_training --continue_from 
>>> d:\tesseract\output\output_checkpoint --traineddata 
>>> d:\tesseract\chi_sim\chi_sim.traineddata --model_output 
>>> d:\tesseract\output\chi_sim.traineddata
>>> 2. Generate an integer traineddata file (similar to tessdata_fast)
>>> lstmtraining --stop_training --convert_to_int --continue_from 
>>> d:\tesseract\output\output_checkpoint --traineddata 
>>> d:\tesseract\chi_sim\chi_sim.traineddata --model_output 
>>> d:\tesseract\output\chi_sim.traineddata
>>>
>>> 3. View the generated traineddata information
>>> combine_tessdata -d d:\tesseract\output\chi_sim.traineddata
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/4f54b4ff-f1f4-4e44-9e49-11a70b759d68n%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/4f54b4ff-f1f4-4e44-9e49-11a70b759d68n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com.
>>
> To view this discussion visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xMU7KTYPbKPTby09M6cEOvBbngD-U8hRyR%2BWZPF3_HhQ%40mail.gmail.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xMU7KTYPbKPTby09M6cEOvBbngD-U8hRyR%2BWZPF3_HhQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/d7c9c21b-cc85-49c7-b7e3-2e40aad4cc35n%40googlegroups.com.

Re: [tesseract-ocr] I tried to train a traineddata file myself, but encountered an [Error]

Reply via email to