Hello, I tried to train a traineddata file myself, but an [Error] occurred 
during use. Could you please give me some guidance on how to resolve this 
error? Thank you very much.
Perform OCR
···
tesseract 0791.tif stdout -l my_chi_sim --psm 6 --oem 2
···
The error content is:
····
Error: Tesseract (legacy) engine requested, but components are not present 
in /usr/local/share/tessdata/my_chi_sim.traineddata!!
Failed loading language 'my_chi_sim'
Tesseract couldn't load any languages!
Could not initialize tesseract.
····

My training steps are as follows:

Punctuation Dictionary:
dawg2wordlist d:\tesseract\tessdata_best\chi_sim.lstm-unicharset 
d:\tesseract\tessdata_best\chi_sim.lstm-punc-dawg 
d:\tesseract\tessdata_best\punc.txt


Let’s start with the key steps
2. Generate character set lstm-unicharset file
1. Generate character set txt file

text2image --text d:\tesseract\chi_sim.txt --outputbase 
d:\tesseract\chi_sim --fonts_dir C:\Windows\Fonts --font="simhei" 
--fontconfig_tmpdir d:\tesseract\tmp


3. Generate character set lstm-unicharset file

1) Generate with box file
unicharset_extractor --norm_mode 3 --output_unicharset 
d:\tesseract\chi_sim.lstm-unicharset d:\tesseract\chi_sim.box

2) Generate with txt file
unicharset_extractor --norm_mode 3 --output_unicharset 
d:\tesseract\chi_sim.lstm-unicharset d:\tesseract\chi_sim.txt


3. Generate starter traineddata file
1. Generate dictionary text file
Refer to the 3 dictionary files in the d:\tesseract\tessdata_best folder 
(word text, number numbers, punc punctuation marks)
2. Generate starter traineddata file
combine_lang_model --input_unicharset d:\tesseract\chi_sim.lstm-unicharset 
--lang chi_sim --script_dir d:\tesseract\langdata_lstm --output_dir 
d:\tesseract --version_str 
"CSDN:watt:2022.04[1,48,0,1C3,3Ft16Mp3,3TxyLfys64Lfx96RxLrx96Lfx512O1c4000]" 
--words d:\tesseract\word.txt --numbers d:\tesseract\number.txt --puncs 
d:\tesseract\punc.txt --pass_through_recoder


3. View the newly generated starter trained data information
combine_tessdata -d d:\tesseract\chi_sim\chi_sim.traineddata

4. Generate training files
1. Generate the training text file train.txt

2. Generate picture+box file

text2image --text d:\tesseract\train.txt --outputbase d:\tesseract\train 
--fonts_dir C:\Windows\Fonts --font="simhei" --ptsize 18 
--fontconfig_tmpdir d:\tesseract\tmp
3. Generate training files:
tesseract d:\tesseract\train.tif d:\tesseract\train -l chi_sim --psm 6 
lstm.train

4. Create a new training list file
Create a new d:\tesseract\train_listfile.txt file with the content 
d:\tesseract\train.lstmf
5. Training

2. Start training:
lstmtraining --traineddata d:\tesseract\chi_sim\chi_sim.traineddata 
--net_spec "[1,48,0,1Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c4000]" 
--model_output d:\tesseract\output\output --train_listfile 
d:\tesseract\train_listfile.txt --max_iterations 0 --target_error_rate 0.01 
--debug_interval -1

6. Evaluate the generated checkpoint file
1. Generate evaluation text eval.txt
Edit some evaluation text and save it to d:\tesseract\eval.txt, so as to 
cover it as comprehensively as possible and with a certain degree of 
complexity.
2. Generate picture+box file
text2image --text d:\tesseract\eval.txt --outputbase d:\tesseract\eval 
--fonts_dir C:\Windows\Fonts --font="simhei" --ptsize 18 
--fontconfig_tmpdir d:\tesseract\tmp
3. Generate evaluation lstmf file
tesseract d:\tesseract\eval.tif d:\tesseract\eval -l chi_sim --psm 6 
lstm.train
4. Generate evaluation list file
Create a new d:\tesseract\eval_listfile.txt file with the content 
d:\tesseract\eval.lstmf
5. Start evaluating

Start evaluating:
lstmeval --model d:\tesseract\output\output_checkpoint --traineddata 
d:\tesseract\chi_sim\chi_sim.traineddata --eval_listfile 
d:\tesseract\eval_listfile.txt
7. Generate standard trained data
1. Generate a floating point (decimal) traineddata file (similar to 
tessdata_best)
lstmtraining --stop_training --continue_from 
d:\tesseract\output\output_checkpoint --traineddata 
d:\tesseract\chi_sim\chi_sim.traineddata --model_output 
d:\tesseract\output\chi_sim.traineddata
2. Generate an integer traineddata file (similar to tessdata_fast)
lstmtraining --stop_training --convert_to_int --continue_from 
d:\tesseract\output\output_checkpoint --traineddata 
d:\tesseract\chi_sim\chi_sim.traineddata --model_output 
d:\tesseract\output\chi_sim.traineddata

3. View the generated traineddata information
combine_tessdata -d d:\tesseract\output\chi_sim.traineddata

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/4f54b4ff-f1f4-4e44-9e49-11a70b759d68n%40googlegroups.com.

Reply via email to