Hello I want make or generated with you a simple file trainddata by jtessboxeditor for Tesseract and test it can you inform me time to discuss The steps. Thanks
في الجمعة، ٦ ديسمبر ٢٠٢٤، ١٠:١١ ص 鹿青年 <luqingnian1...@gmail.com> كتب: > Hello, I tried to train a traineddata file myself, but an [Error] occurred > during use. Could you please give me some guidance on how to resolve this > error? Thank you very much. > Perform OCR > ··· > tesseract 0791.tif stdout -l my_chi_sim --psm 6 --oem 2 > ··· > The error content is: > ···· > Error: Tesseract (legacy) engine requested, but components are not present > in /usr/local/share/tessdata/my_chi_sim.traineddata!! > Failed loading language 'my_chi_sim' > Tesseract couldn't load any languages! > Could not initialize tesseract. > ···· > > My training steps are as follows: > > Punctuation Dictionary: > dawg2wordlist d:\tesseract\tessdata_best\chi_sim.lstm-unicharset > d:\tesseract\tessdata_best\chi_sim.lstm-punc-dawg > d:\tesseract\tessdata_best\punc.txt > > > Let’s start with the key steps > 2. Generate character set lstm-unicharset file > 1. Generate character set txt file > > text2image --text d:\tesseract\chi_sim.txt --outputbase > d:\tesseract\chi_sim --fonts_dir C:\Windows\Fonts --font="simhei" > --fontconfig_tmpdir d:\tesseract\tmp > > > 3. Generate character set lstm-unicharset file > > 1) Generate with box file > unicharset_extractor --norm_mode 3 --output_unicharset > d:\tesseract\chi_sim.lstm-unicharset d:\tesseract\chi_sim.box > > 2) Generate with txt file > unicharset_extractor --norm_mode 3 --output_unicharset > d:\tesseract\chi_sim.lstm-unicharset d:\tesseract\chi_sim.txt > > > 3. Generate starter traineddata file > 1. Generate dictionary text file > Refer to the 3 dictionary files in the d:\tesseract\tessdata_best folder > (word text, number numbers, punc punctuation marks) > 2. Generate starter traineddata file > combine_lang_model --input_unicharset d:\tesseract\chi_sim.lstm-unicharset > --lang chi_sim --script_dir d:\tesseract\langdata_lstm --output_dir > d:\tesseract --version_str > "CSDN:watt:2022.04[1,48,0,1C3,3Ft16Mp3,3TxyLfys64Lfx96RxLrx96Lfx512O1c4000]" > --words d:\tesseract\word.txt --numbers d:\tesseract\number.txt --puncs > d:\tesseract\punc.txt --pass_through_recoder > > > 3. View the newly generated starter trained data information > combine_tessdata -d d:\tesseract\chi_sim\chi_sim.traineddata > > 4. Generate training files > 1. Generate the training text file train.txt > > 2. Generate picture+box file > > text2image --text d:\tesseract\train.txt --outputbase d:\tesseract\train > --fonts_dir C:\Windows\Fonts --font="simhei" --ptsize 18 > --fontconfig_tmpdir d:\tesseract\tmp > 3. Generate training files: > tesseract d:\tesseract\train.tif d:\tesseract\train -l chi_sim --psm 6 > lstm.train > > 4. Create a new training list file > Create a new d:\tesseract\train_listfile.txt file with the content > d:\tesseract\train.lstmf > 5. Training > > 2. Start training: > lstmtraining --traineddata d:\tesseract\chi_sim\chi_sim.traineddata > --net_spec "[1,48,0,1Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c4000]" > --model_output d:\tesseract\output\output --train_listfile > d:\tesseract\train_listfile.txt --max_iterations 0 --target_error_rate 0.01 > --debug_interval -1 > > 6. Evaluate the generated checkpoint file > 1. Generate evaluation text eval.txt > Edit some evaluation text and save it to d:\tesseract\eval.txt, so as to > cover it as comprehensively as possible and with a certain degree of > complexity. > 2. Generate picture+box file > text2image --text d:\tesseract\eval.txt --outputbase d:\tesseract\eval > --fonts_dir C:\Windows\Fonts --font="simhei" --ptsize 18 > --fontconfig_tmpdir d:\tesseract\tmp > 3. Generate evaluation lstmf file > tesseract d:\tesseract\eval.tif d:\tesseract\eval -l chi_sim --psm 6 > lstm.train > 4. Generate evaluation list file > Create a new d:\tesseract\eval_listfile.txt file with the content > d:\tesseract\eval.lstmf > 5. Start evaluating > > Start evaluating: > lstmeval --model d:\tesseract\output\output_checkpoint --traineddata > d:\tesseract\chi_sim\chi_sim.traineddata --eval_listfile > d:\tesseract\eval_listfile.txt > 7. Generate standard trained data > 1. Generate a floating point (decimal) traineddata file (similar to > tessdata_best) > lstmtraining --stop_training --continue_from > d:\tesseract\output\output_checkpoint --traineddata > d:\tesseract\chi_sim\chi_sim.traineddata --model_output > d:\tesseract\output\chi_sim.traineddata > 2. Generate an integer traineddata file (similar to tessdata_fast) > lstmtraining --stop_training --convert_to_int --continue_from > d:\tesseract\output\output_checkpoint --traineddata > d:\tesseract\chi_sim\chi_sim.traineddata --model_output > d:\tesseract\output\chi_sim.traineddata > > 3. View the generated traineddata information > combine_tessdata -d d:\tesseract\output\chi_sim.traineddata > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion visit > https://groups.google.com/d/msgid/tesseract-ocr/4f54b4ff-f1f4-4e44-9e49-11a70b759d68n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/4f54b4ff-f1f4-4e44-9e49-11a70b759d68n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/CAB5aXsnXG%3DDnNnkZPmO6kopRThYoibnu-5_TV%3DN5DknuCPgBog%40mail.gmail.com.