Thank you for your reply. How should I proceed to merge the old engine into my trained model? Or, are there any parameters that can specify that the OCR operation should not use the old engine?
在2024年12月6日星期五 UTC+8 15:07:14<zdenop> 写道: > > Error: Tesseract (legacy) engine requested, but components are not present > in /usr/local/share/tessdata/my_chi_sim.traineddata!! > > > The message is clear. YOU require tesseract to use legacy engine > explicitly but YOUR language datafile (you created by training) does not > contain legacy model. > > Zdenko > > > pi 6. 12. 2024 o 7:11 鹿青年 <luqingn...@gmail.com> napísal(a): > >> Hello, I tried to train a traineddata file myself, but an [Error] >> occurred during use. Could you please give me some guidance on how to >> resolve this error? Thank you very much. >> Perform OCR >> ··· >> tesseract 0791.tif stdout -l my_chi_sim --psm 6 --oem 2 >> ··· >> The error content is: >> ···· >> Error: Tesseract (legacy) engine requested, but components are not >> present in /usr/local/share/tessdata/my_chi_sim.traineddata!! >> Failed loading language 'my_chi_sim' >> Tesseract couldn't load any languages! >> Could not initialize tesseract. >> ···· >> >> My training steps are as follows: >> >> Punctuation Dictionary: >> dawg2wordlist d:\tesseract\tessdata_best\chi_sim.lstm-unicharset >> d:\tesseract\tessdata_best\chi_sim.lstm-punc-dawg >> d:\tesseract\tessdata_best\punc.txt >> >> >> Let’s start with the key steps >> 2. Generate character set lstm-unicharset file >> 1. Generate character set txt file >> >> text2image --text d:\tesseract\chi_sim.txt --outputbase >> d:\tesseract\chi_sim --fonts_dir C:\Windows\Fonts --font="simhei" >> --fontconfig_tmpdir d:\tesseract\tmp >> >> >> 3. Generate character set lstm-unicharset file >> >> 1) Generate with box file >> unicharset_extractor --norm_mode 3 --output_unicharset >> d:\tesseract\chi_sim.lstm-unicharset d:\tesseract\chi_sim.box >> >> 2) Generate with txt file >> unicharset_extractor --norm_mode 3 --output_unicharset >> d:\tesseract\chi_sim.lstm-unicharset d:\tesseract\chi_sim.txt >> >> >> 3. Generate starter traineddata file >> 1. Generate dictionary text file >> Refer to the 3 dictionary files in the d:\tesseract\tessdata_best folder >> (word text, number numbers, punc punctuation marks) >> 2. Generate starter traineddata file >> combine_lang_model --input_unicharset >> d:\tesseract\chi_sim.lstm-unicharset --lang chi_sim --script_dir >> d:\tesseract\langdata_lstm --output_dir d:\tesseract --version_str >> "CSDN:watt:2022.04[1,48,0,1C3,3Ft16Mp3,3TxyLfys64Lfx96RxLrx96Lfx512O1c4000]" >> --words d:\tesseract\word.txt --numbers d:\tesseract\number.txt --puncs >> d:\tesseract\punc.txt --pass_through_recoder >> >> >> 3. View the newly generated starter trained data information >> combine_tessdata -d d:\tesseract\chi_sim\chi_sim.traineddata >> >> 4. Generate training files >> 1. Generate the training text file train.txt >> >> 2. Generate picture+box file >> >> text2image --text d:\tesseract\train.txt --outputbase d:\tesseract\train >> --fonts_dir C:\Windows\Fonts --font="simhei" --ptsize 18 >> --fontconfig_tmpdir d:\tesseract\tmp >> 3. Generate training files: >> tesseract d:\tesseract\train.tif d:\tesseract\train -l chi_sim --psm 6 >> lstm.train >> >> 4. Create a new training list file >> Create a new d:\tesseract\train_listfile.txt file with the content >> d:\tesseract\train.lstmf >> 5. Training >> >> 2. Start training: >> lstmtraining --traineddata d:\tesseract\chi_sim\chi_sim.traineddata >> --net_spec "[1,48,0,1Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c4000]" >> --model_output d:\tesseract\output\output --train_listfile >> d:\tesseract\train_listfile.txt --max_iterations 0 --target_error_rate 0.01 >> --debug_interval -1 >> >> 6. Evaluate the generated checkpoint file >> 1. Generate evaluation text eval.txt >> Edit some evaluation text and save it to d:\tesseract\eval.txt, so as to >> cover it as comprehensively as possible and with a certain degree of >> complexity. >> 2. Generate picture+box file >> text2image --text d:\tesseract\eval.txt --outputbase d:\tesseract\eval >> --fonts_dir C:\Windows\Fonts --font="simhei" --ptsize 18 >> --fontconfig_tmpdir d:\tesseract\tmp >> 3. Generate evaluation lstmf file >> tesseract d:\tesseract\eval.tif d:\tesseract\eval -l chi_sim --psm 6 >> lstm.train >> 4. Generate evaluation list file >> Create a new d:\tesseract\eval_listfile.txt file with the content >> d:\tesseract\eval.lstmf >> 5. Start evaluating >> >> Start evaluating: >> lstmeval --model d:\tesseract\output\output_checkpoint --traineddata >> d:\tesseract\chi_sim\chi_sim.traineddata --eval_listfile >> d:\tesseract\eval_listfile.txt >> 7. Generate standard trained data >> 1. Generate a floating point (decimal) traineddata file (similar to >> tessdata_best) >> lstmtraining --stop_training --continue_from >> d:\tesseract\output\output_checkpoint --traineddata >> d:\tesseract\chi_sim\chi_sim.traineddata --model_output >> d:\tesseract\output\chi_sim.traineddata >> 2. Generate an integer traineddata file (similar to tessdata_fast) >> lstmtraining --stop_training --convert_to_int --continue_from >> d:\tesseract\output\output_checkpoint --traineddata >> d:\tesseract\chi_sim\chi_sim.traineddata --model_output >> d:\tesseract\output\chi_sim.traineddata >> >> 3. View the generated traineddata information >> combine_tessdata -d d:\tesseract\output\chi_sim.traineddata >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com. >> To view this discussion visit >> https://groups.google.com/d/msgid/tesseract-ocr/4f54b4ff-f1f4-4e44-9e49-11a70b759d68n%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/4f54b4ff-f1f4-4e44-9e49-11a70b759d68n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/b0a9f82d-128c-4d10-8564-43ff8febd8d8n%40googlegroups.com.