After completing the training for each image, click Save, then specify the program path to the box files, then select the option to create a training file using external boxes to automatically find that the training process has been completed and your . traineddata. file has been created. To name it and add it to the Tesseract-OCR database. In the tessdata file
في الجمعة، ٦ ديسمبر ٢٠٢٤، ١١:٤٤ ص محمود محمد <mahmoudmm55...@gmail.com> كتب: > You can collect images and put them in a folder and then gettysboxaddtor > to create a training file for your model from your collection of images. To > start, first select the language and then create the box files by > specifying the path of the images and then click on the Create box files > box and then start training, creating and improving your model > > في الجمعة، ٦ ديسمبر ٢٠٢٤، ١١:٣٧ ص 鹿青年 <luqingnian1...@gmail.com> كتب: > >> >> Thanks for the reply. >> Yes, I also use jeTesBoxEditor at the same time, but jeTesBoxEditor is >> more like data standardization. Some of the font files have incomplete >> fonts. I want to use LSTM training to train a complete autologous library >> file of my own. >> 在2024年12月6日星期五 UTC+8 15:15:40<mahmoud...@gmail.com> 写道: >> >>> I think using jeTesBoxEditor is good for training process >>> >>> في الجمعة، ٦ ديسمبر ٢٠٢٤، ١١:٠٧ ص Zdenko Podobny <zde...@gmail.com> كتب: >>> >>>> >>>> Error: Tesseract (legacy) engine requested, but components are not >>>> present in /usr/local/share/tessdata/my_chi_sim.traineddata!! >>>> >>>> >>>> The message is clear. YOU require tesseract to use legacy engine >>>> explicitly but YOUR language datafile (you created by training) does not >>>> contain legacy model. >>>> >>>> Zdenko >>>> >>>> >>>> pi 6. 12. 2024 o 7:11 鹿青年 <luqingn...@gmail.com> napísal(a): >>>> >>>>> Hello, I tried to train a traineddata file myself, but an [Error] >>>>> occurred during use. Could you please give me some guidance on how to >>>>> resolve this error? Thank you very much. >>>>> Perform OCR >>>>> ··· >>>>> tesseract 0791.tif stdout -l my_chi_sim --psm 6 --oem 2 >>>>> ··· >>>>> The error content is: >>>>> ···· >>>>> Error: Tesseract (legacy) engine requested, but components are not >>>>> present in /usr/local/share/tessdata/my_chi_sim.traineddata!! >>>>> Failed loading language 'my_chi_sim' >>>>> Tesseract couldn't load any languages! >>>>> Could not initialize tesseract. >>>>> ···· >>>>> >>>>> My training steps are as follows: >>>>> >>>>> Punctuation Dictionary: >>>>> dawg2wordlist d:\tesseract\tessdata_best\chi_sim.lstm-unicharset >>>>> d:\tesseract\tessdata_best\chi_sim.lstm-punc-dawg >>>>> d:\tesseract\tessdata_best\punc.txt >>>>> >>>>> >>>>> Let’s start with the key steps >>>>> 2. Generate character set lstm-unicharset file >>>>> 1. Generate character set txt file >>>>> >>>>> text2image --text d:\tesseract\chi_sim.txt --outputbase >>>>> d:\tesseract\chi_sim --fonts_dir C:\Windows\Fonts --font="simhei" >>>>> --fontconfig_tmpdir d:\tesseract\tmp >>>>> >>>>> >>>>> 3. Generate character set lstm-unicharset file >>>>> >>>>> 1) Generate with box file >>>>> unicharset_extractor --norm_mode 3 --output_unicharset >>>>> d:\tesseract\chi_sim.lstm-unicharset d:\tesseract\chi_sim.box >>>>> >>>>> 2) Generate with txt file >>>>> unicharset_extractor --norm_mode 3 --output_unicharset >>>>> d:\tesseract\chi_sim.lstm-unicharset d:\tesseract\chi_sim.txt >>>>> >>>>> >>>>> 3. Generate starter traineddata file >>>>> 1. Generate dictionary text file >>>>> Refer to the 3 dictionary files in the d:\tesseract\tessdata_best >>>>> folder (word text, number numbers, punc punctuation marks) >>>>> 2. Generate starter traineddata file >>>>> combine_lang_model --input_unicharset >>>>> d:\tesseract\chi_sim.lstm-unicharset --lang chi_sim --script_dir >>>>> d:\tesseract\langdata_lstm --output_dir d:\tesseract --version_str >>>>> "CSDN:watt:2022.04[1,48,0,1C3,3Ft16Mp3,3TxyLfys64Lfx96RxLrx96Lfx512O1c4000]" >>>>> --words d:\tesseract\word.txt --numbers d:\tesseract\number.txt --puncs >>>>> d:\tesseract\punc.txt --pass_through_recoder >>>>> >>>>> >>>>> 3. View the newly generated starter trained data information >>>>> combine_tessdata -d d:\tesseract\chi_sim\chi_sim.traineddata >>>>> >>>>> 4. Generate training files >>>>> 1. Generate the training text file train.txt >>>>> >>>>> 2. Generate picture+box file >>>>> >>>>> text2image --text d:\tesseract\train.txt --outputbase >>>>> d:\tesseract\train --fonts_dir C:\Windows\Fonts --font="simhei" --ptsize >>>>> 18 >>>>> --fontconfig_tmpdir d:\tesseract\tmp >>>>> 3. Generate training files: >>>>> tesseract d:\tesseract\train.tif d:\tesseract\train -l chi_sim --psm 6 >>>>> lstm.train >>>>> >>>>> 4. Create a new training list file >>>>> Create a new d:\tesseract\train_listfile.txt file with the content >>>>> d:\tesseract\train.lstmf >>>>> 5. Training >>>>> >>>>> 2. Start training: >>>>> lstmtraining --traineddata d:\tesseract\chi_sim\chi_sim.traineddata >>>>> --net_spec "[1,48,0,1Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c4000]" >>>>> --model_output d:\tesseract\output\output --train_listfile >>>>> d:\tesseract\train_listfile.txt --max_iterations 0 --target_error_rate >>>>> 0.01 >>>>> --debug_interval -1 >>>>> >>>>> 6. Evaluate the generated checkpoint file >>>>> 1. Generate evaluation text eval.txt >>>>> Edit some evaluation text and save it to d:\tesseract\eval.txt, so as >>>>> to cover it as comprehensively as possible and with a certain degree of >>>>> complexity. >>>>> 2. Generate picture+box file >>>>> text2image --text d:\tesseract\eval.txt --outputbase d:\tesseract\eval >>>>> --fonts_dir C:\Windows\Fonts --font="simhei" --ptsize 18 >>>>> --fontconfig_tmpdir d:\tesseract\tmp >>>>> 3. Generate evaluation lstmf file >>>>> tesseract d:\tesseract\eval.tif d:\tesseract\eval -l chi_sim --psm 6 >>>>> lstm.train >>>>> 4. Generate evaluation list file >>>>> Create a new d:\tesseract\eval_listfile.txt file with the content >>>>> d:\tesseract\eval.lstmf >>>>> 5. Start evaluating >>>>> >>>>> Start evaluating: >>>>> lstmeval --model d:\tesseract\output\output_checkpoint --traineddata >>>>> d:\tesseract\chi_sim\chi_sim.traineddata --eval_listfile >>>>> d:\tesseract\eval_listfile.txt >>>>> 7. Generate standard trained data >>>>> 1. Generate a floating point (decimal) traineddata file (similar to >>>>> tessdata_best) >>>>> lstmtraining --stop_training --continue_from >>>>> d:\tesseract\output\output_checkpoint --traineddata >>>>> d:\tesseract\chi_sim\chi_sim.traineddata --model_output >>>>> d:\tesseract\output\chi_sim.traineddata >>>>> 2. Generate an integer traineddata file (similar to tessdata_fast) >>>>> lstmtraining --stop_training --convert_to_int --continue_from >>>>> d:\tesseract\output\output_checkpoint --traineddata >>>>> d:\tesseract\chi_sim\chi_sim.traineddata --model_output >>>>> d:\tesseract\output\chi_sim.traineddata >>>>> >>>>> 3. View the generated traineddata information >>>>> combine_tessdata -d d:\tesseract\output\chi_sim.traineddata >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesseract-oc...@googlegroups.com. >>>>> To view this discussion visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/4f54b4ff-f1f4-4e44-9e49-11a70b759d68n%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/4f54b4ff-f1f4-4e44-9e49-11a70b759d68n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-oc...@googlegroups.com. >>>> >>> To view this discussion visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xMU7KTYPbKPTby09M6cEOvBbngD-U8hRyR%2BWZPF3_HhQ%40mail.gmail.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xMU7KTYPbKPTby09M6cEOvBbngD-U8hRyR%2BWZPF3_HhQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-ocr+unsubscr...@googlegroups.com. >> To view this discussion visit >> https://groups.google.com/d/msgid/tesseract-ocr/d7c9c21b-cc85-49c7-b7e3-2e40aad4cc35n%40googlegroups.com >> <https://groups.google.com/d/msgid/tesseract-ocr/d7c9c21b-cc85-49c7-b7e3-2e40aad4cc35n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/CAB5aXsk8p%3DcpCMdoMQivzwiBAO0F%3DC_3MaNNTy-7PRTqfVQSpA%40mail.gmail.com.