I think using jeTesBoxEditor is good for training process

في الجمعة، ٦ ديسمبر ٢٠٢٤، ١١:٠٧ ص Zdenko Podobny <zde...@gmail.com> كتب:

>
> Error: Tesseract (legacy) engine requested, but components are not present
> in /usr/local/share/tessdata/my_chi_sim.traineddata!!
>
>
> The message is clear. YOU require tesseract to use legacy engine
> explicitly but YOUR language datafile (you created by training) does not
> contain legacy model.
>
> Zdenko
>
>
> pi 6. 12. 2024 o 7:11 鹿青年 <luqingnian1...@gmail.com> napísal(a):
>
>> Hello, I tried to train a traineddata file myself, but an [Error]
>> occurred during use. Could you please give me some guidance on how to
>> resolve this error? Thank you very much.
>> Perform OCR
>> ···
>> tesseract 0791.tif stdout -l my_chi_sim --psm 6 --oem 2
>> ···
>> The error content is:
>> ····
>> Error: Tesseract (legacy) engine requested, but components are not
>> present in /usr/local/share/tessdata/my_chi_sim.traineddata!!
>> Failed loading language 'my_chi_sim'
>> Tesseract couldn't load any languages!
>> Could not initialize tesseract.
>> ····
>>
>> My training steps are as follows:
>>
>> Punctuation Dictionary:
>> dawg2wordlist d:\tesseract\tessdata_best\chi_sim.lstm-unicharset
>> d:\tesseract\tessdata_best\chi_sim.lstm-punc-dawg
>> d:\tesseract\tessdata_best\punc.txt
>>
>>
>> Let’s start with the key steps
>> 2. Generate character set lstm-unicharset file
>> 1. Generate character set txt file
>>
>> text2image --text d:\tesseract\chi_sim.txt --outputbase
>> d:\tesseract\chi_sim --fonts_dir C:\Windows\Fonts --font="simhei"
>> --fontconfig_tmpdir d:\tesseract\tmp
>>
>>
>> 3. Generate character set lstm-unicharset file
>>
>> 1) Generate with box file
>> unicharset_extractor --norm_mode 3 --output_unicharset
>> d:\tesseract\chi_sim.lstm-unicharset d:\tesseract\chi_sim.box
>>
>> 2) Generate with txt file
>> unicharset_extractor --norm_mode 3 --output_unicharset
>> d:\tesseract\chi_sim.lstm-unicharset d:\tesseract\chi_sim.txt
>>
>>
>> 3. Generate starter traineddata file
>> 1. Generate dictionary text file
>> Refer to the 3 dictionary files in the d:\tesseract\tessdata_best folder
>> (word text, number numbers, punc punctuation marks)
>> 2. Generate starter traineddata file
>> combine_lang_model --input_unicharset
>> d:\tesseract\chi_sim.lstm-unicharset --lang chi_sim --script_dir
>> d:\tesseract\langdata_lstm --output_dir d:\tesseract --version_str
>> "CSDN:watt:2022.04[1,48,0,1C3,3Ft16Mp3,3TxyLfys64Lfx96RxLrx96Lfx512O1c4000]"
>> --words d:\tesseract\word.txt --numbers d:\tesseract\number.txt --puncs
>> d:\tesseract\punc.txt --pass_through_recoder
>>
>>
>> 3. View the newly generated starter trained data information
>> combine_tessdata -d d:\tesseract\chi_sim\chi_sim.traineddata
>>
>> 4. Generate training files
>> 1. Generate the training text file train.txt
>>
>> 2. Generate picture+box file
>>
>> text2image --text d:\tesseract\train.txt --outputbase d:\tesseract\train
>> --fonts_dir C:\Windows\Fonts --font="simhei" --ptsize 18
>> --fontconfig_tmpdir d:\tesseract\tmp
>> 3. Generate training files:
>> tesseract d:\tesseract\train.tif d:\tesseract\train -l chi_sim --psm 6
>> lstm.train
>>
>> 4. Create a new training list file
>> Create a new d:\tesseract\train_listfile.txt file with the content
>> d:\tesseract\train.lstmf
>> 5. Training
>>
>> 2. Start training:
>> lstmtraining --traineddata d:\tesseract\chi_sim\chi_sim.traineddata
>> --net_spec "[1,48,0,1Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c4000]"
>> --model_output d:\tesseract\output\output --train_listfile
>> d:\tesseract\train_listfile.txt --max_iterations 0 --target_error_rate 0.01
>> --debug_interval -1
>>
>> 6. Evaluate the generated checkpoint file
>> 1. Generate evaluation text eval.txt
>> Edit some evaluation text and save it to d:\tesseract\eval.txt, so as to
>> cover it as comprehensively as possible and with a certain degree of
>> complexity.
>> 2. Generate picture+box file
>> text2image --text d:\tesseract\eval.txt --outputbase d:\tesseract\eval
>> --fonts_dir C:\Windows\Fonts --font="simhei" --ptsize 18
>> --fontconfig_tmpdir d:\tesseract\tmp
>> 3. Generate evaluation lstmf file
>> tesseract d:\tesseract\eval.tif d:\tesseract\eval -l chi_sim --psm 6
>> lstm.train
>> 4. Generate evaluation list file
>> Create a new d:\tesseract\eval_listfile.txt file with the content
>> d:\tesseract\eval.lstmf
>> 5. Start evaluating
>>
>> Start evaluating:
>> lstmeval --model d:\tesseract\output\output_checkpoint --traineddata
>> d:\tesseract\chi_sim\chi_sim.traineddata --eval_listfile
>> d:\tesseract\eval_listfile.txt
>> 7. Generate standard trained data
>> 1. Generate a floating point (decimal) traineddata file (similar to
>> tessdata_best)
>> lstmtraining --stop_training --continue_from
>> d:\tesseract\output\output_checkpoint --traineddata
>> d:\tesseract\chi_sim\chi_sim.traineddata --model_output
>> d:\tesseract\output\chi_sim.traineddata
>> 2. Generate an integer traineddata file (similar to tessdata_fast)
>> lstmtraining --stop_training --convert_to_int --continue_from
>> d:\tesseract\output\output_checkpoint --traineddata
>> d:\tesseract\chi_sim\chi_sim.traineddata --model_output
>> d:\tesseract\output\chi_sim.traineddata
>>
>> 3. View the generated traineddata information
>> combine_tessdata -d d:\tesseract\output\chi_sim.traineddata
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion visit
>> https://groups.google.com/d/msgid/tesseract-ocr/4f54b4ff-f1f4-4e44-9e49-11a70b759d68n%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/4f54b4ff-f1f4-4e44-9e49-11a70b759d68n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xMU7KTYPbKPTby09M6cEOvBbngD-U8hRyR%2BWZPF3_HhQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xMU7KTYPbKPTby09M6cEOvBbngD-U8hRyR%2BWZPF3_HhQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAB5aXs%3DgExE9tXQLVfh%3D7tMa6GHTQPVWF_LU266PJFeKHA-bAw%40mail.gmail.com.

Reply via email to