Re: [tesseract-ocr] I tried to train a traineddata file myself, but encountered an [Error]

Zdenko Podobny Fri, 06 Dec 2024 02:55:12 -0800

Your question indicates you have no clue what you are doing with tesseract
and also with training.
First you need to invest time to learn tesseract and read documentation.


Zdenko


pi 6. 12. 2024 o 8:17 鹿青年 <luqingnian1...@gmail.com> napísal(a):

> Thank you for your reply. How should I proceed to merge the old engine
> into my trained model?
> Or, are there any parameters that can specify that the OCR operation
> should not use the old engine?
>
> 在2024年12月6日星期五 UTC+8 15:07:14<zdenop> 写道：
>
>>
>> Error: Tesseract (legacy) engine requested, but components are not
>> present in /usr/local/share/tessdata/my_chi_sim.traineddata!!
>>
>>
>> The message is clear. YOU require tesseract to use legacy engine
>> explicitly but YOUR language datafile (you created by training) does not
>> contain legacy model.
>>
>> Zdenko
>>
>>
>> pi 6. 12. 2024 o 7:11 鹿青年 <luqingn...@gmail.com> napísal(a):
>>
>>> Hello, I tried to train a traineddata file myself, but an [Error]
>>> occurred during use. Could you please give me some guidance on how to
>>> resolve this error? Thank you very much.
>>> Perform OCR
>>> ···
>>> tesseract 0791.tif stdout -l my_chi_sim --psm 6 --oem 2
>>> ···
>>> The error content is:
>>> ····
>>> Error: Tesseract (legacy) engine requested, but components are not
>>> present in /usr/local/share/tessdata/my_chi_sim.traineddata!!
>>> Failed loading language 'my_chi_sim'
>>> Tesseract couldn't load any languages!
>>> Could not initialize tesseract.
>>> ····
>>>
>>> My training steps are as follows:
>>>
>>> Punctuation Dictionary:
>>> dawg2wordlist d:\tesseract\tessdata_best\chi_sim.lstm-unicharset
>>> d:\tesseract\tessdata_best\chi_sim.lstm-punc-dawg
>>> d:\tesseract\tessdata_best\punc.txt
>>>
>>>
>>> Let’s start with the key steps
>>> 2. Generate character set lstm-unicharset file
>>> 1. Generate character set txt file
>>>
>>> text2image --text d:\tesseract\chi_sim.txt --outputbase
>>> d:\tesseract\chi_sim --fonts_dir C:\Windows\Fonts --font="simhei"
>>> --fontconfig_tmpdir d:\tesseract\tmp
>>>
>>>
>>> 3. Generate character set lstm-unicharset file
>>>
>>> 1) Generate with box file
>>> unicharset_extractor --norm_mode 3 --output_unicharset
>>> d:\tesseract\chi_sim.lstm-unicharset d:\tesseract\chi_sim.box
>>>
>>> 2) Generate with txt file
>>> unicharset_extractor --norm_mode 3 --output_unicharset
>>> d:\tesseract\chi_sim.lstm-unicharset d:\tesseract\chi_sim.txt
>>>
>>>
>>> 3. Generate starter traineddata file
>>> 1. Generate dictionary text file
>>> Refer to the 3 dictionary files in the d:\tesseract\tessdata_best folder
>>> (word text, number numbers, punc punctuation marks)
>>> 2. Generate starter traineddata file
>>> combine_lang_model --input_unicharset
>>> d:\tesseract\chi_sim.lstm-unicharset --lang chi_sim --script_dir
>>> d:\tesseract\langdata_lstm --output_dir d:\tesseract --version_str
>>> "CSDN:watt:2022.04[1,48,0,1C3,3Ft16Mp3,3TxyLfys64Lfx96RxLrx96Lfx512O1c4000]"
>>> --words d:\tesseract\word.txt --numbers d:\tesseract\number.txt --puncs
>>> d:\tesseract\punc.txt --pass_through_recoder
>>>
>>>
>>> 3. View the newly generated starter trained data information
>>> combine_tessdata -d d:\tesseract\chi_sim\chi_sim.traineddata
>>>
>>> 4. Generate training files
>>> 1. Generate the training text file train.txt
>>>
>>> 2. Generate picture+box file
>>>
>>> text2image --text d:\tesseract\train.txt --outputbase d:\tesseract\train
>>> --fonts_dir C:\Windows\Fonts --font="simhei" --ptsize 18
>>> --fontconfig_tmpdir d:\tesseract\tmp
>>> 3. Generate training files:
>>> tesseract d:\tesseract\train.tif d:\tesseract\train -l chi_sim --psm 6
>>> lstm.train
>>>
>>> 4. Create a new training list file
>>> Create a new d:\tesseract\train_listfile.txt file with the content
>>> d:\tesseract\train.lstmf
>>> 5. Training
>>>
>>> 2. Start training:
>>> lstmtraining --traineddata d:\tesseract\chi_sim\chi_sim.traineddata
>>> --net_spec "[1,48,0,1Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c4000]"
>>> --model_output d:\tesseract\output\output --train_listfile
>>> d:\tesseract\train_listfile.txt --max_iterations 0 --target_error_rate 0.01
>>> --debug_interval -1
>>>
>>> 6. Evaluate the generated checkpoint file
>>> 1. Generate evaluation text eval.txt
>>> Edit some evaluation text and save it to d:\tesseract\eval.txt, so as to
>>> cover it as comprehensively as possible and with a certain degree of
>>> complexity.
>>> 2. Generate picture+box file
>>> text2image --text d:\tesseract\eval.txt --outputbase d:\tesseract\eval
>>> --fonts_dir C:\Windows\Fonts --font="simhei" --ptsize 18
>>> --fontconfig_tmpdir d:\tesseract\tmp
>>> 3. Generate evaluation lstmf file
>>> tesseract d:\tesseract\eval.tif d:\tesseract\eval -l chi_sim --psm 6
>>> lstm.train
>>> 4. Generate evaluation list file
>>> Create a new d:\tesseract\eval_listfile.txt file with the content
>>> d:\tesseract\eval.lstmf
>>> 5. Start evaluating
>>>
>>> Start evaluating:
>>> lstmeval --model d:\tesseract\output\output_checkpoint --traineddata
>>> d:\tesseract\chi_sim\chi_sim.traineddata --eval_listfile
>>> d:\tesseract\eval_listfile.txt
>>> 7. Generate standard trained data
>>> 1. Generate a floating point (decimal) traineddata file (similar to
>>> tessdata_best)
>>> lstmtraining --stop_training --continue_from
>>> d:\tesseract\output\output_checkpoint --traineddata
>>> d:\tesseract\chi_sim\chi_sim.traineddata --model_output
>>> d:\tesseract\output\chi_sim.traineddata
>>> 2. Generate an integer traineddata file (similar to tessdata_fast)
>>> lstmtraining --stop_training --convert_to_int --continue_from
>>> d:\tesseract\output\output_checkpoint --traineddata
>>> d:\tesseract\chi_sim\chi_sim.traineddata --model_output
>>> d:\tesseract\output\chi_sim.traineddata
>>>
>>> 3. View the generated traineddata information
>>> combine_tessdata -d d:\tesseract\output\chi_sim.traineddata
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To view this discussion visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/4f54b4ff-f1f4-4e44-9e49-11a70b759d68n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/4f54b4ff-f1f4-4e44-9e49-11a70b759d68n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion visit
> https://groups.google.com/d/msgid/tesseract-ocr/e1e5f54a-d6fa-41bb-8d44-23d681e0821en%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/e1e5f54a-d6fa-41bb-8d44-23d681e0821en%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8zO%3D3ZTy6GKwHmYZ%3D2Vh533OYkZmncThbtVa6w8hvGW8Q%40mail.gmail.com.

Re: [tesseract-ocr] I tried to train a traineddata file myself, but encountered an [Error]

Reply via email to