You can collect images and put them in a folder and then gettysboxaddtor to create a training file for your model from your collection of images. To start, first select the language and then create the box files by specifying the path of the images and then click on the Create box files box and then start training, creating and improving your model
في الجمعة، ٦ ديسمبر ٢٠٢٤، ١١:٣٧ ص 鹿青年 <luqingnian1...@gmail.com> كتب: > > Thanks for the reply. > Yes, I also use jeTesBoxEditor at the same time, but jeTesBoxEditor is > more like data standardization. Some of the font files have incomplete > fonts. I want to use LSTM training to train a complete autologous library > file of my own. > 在2024年12月6日星期五 UTC+8 15:15:40<mahmoud...@gmail.com> 写道: > >> I think using jeTesBoxEditor is good for training process >> >> في الجمعة، ٦ ديسمبر ٢٠٢٤، ١١:٠٧ ص Zdenko Podobny <zde...@gmail.com> كتب: >> >>> >>> Error: Tesseract (legacy) engine requested, but components are not >>> present in /usr/local/share/tessdata/my_chi_sim.traineddata!! >>> >>> >>> The message is clear. YOU require tesseract to use legacy engine >>> explicitly but YOUR language datafile (you created by training) does not >>> contain legacy model. >>> >>> Zdenko >>> >>> >>> pi 6. 12. 2024 o 7:11 鹿青年 <luqingn...@gmail.com> napísal(a): >>> >>>> Hello, I tried to train a traineddata file myself, but an [Error] >>>> occurred during use. Could you please give me some guidance on how to >>>> resolve this error? Thank you very much. >>>> Perform OCR >>>> ··· >>>> tesseract 0791.tif stdout -l my_chi_sim --psm 6 --oem 2 >>>> ··· >>>> The error content is: >>>> ···· >>>> Error: Tesseract (legacy) engine requested, but components are not >>>> present in /usr/local/share/tessdata/my_chi_sim.traineddata!! >>>> Failed loading language 'my_chi_sim' >>>> Tesseract couldn't load any languages! >>>> Could not initialize tesseract. >>>> ···· >>>> >>>> My training steps are as follows: >>>> >>>> Punctuation Dictionary: >>>> dawg2wordlist d:\tesseract\tessdata_best\chi_sim.lstm-unicharset >>>> d:\tesseract\tessdata_best\chi_sim.lstm-punc-dawg >>>> d:\tesseract\tessdata_best\punc.txt >>>> >>>> >>>> Let’s start with the key steps >>>> 2. Generate character set lstm-unicharset file >>>> 1. Generate character set txt file >>>> >>>> text2image --text d:\tesseract\chi_sim.txt --outputbase >>>> d:\tesseract\chi_sim --fonts_dir C:\Windows\Fonts --font="simhei" >>>> --fontconfig_tmpdir d:\tesseract\tmp >>>> >>>> >>>> 3. Generate character set lstm-unicharset file >>>> >>>> 1) Generate with box file >>>> unicharset_extractor --norm_mode 3 --output_unicharset >>>> d:\tesseract\chi_sim.lstm-unicharset d:\tesseract\chi_sim.box >>>> >>>> 2) Generate with txt file >>>> unicharset_extractor --norm_mode 3 --output_unicharset >>>> d:\tesseract\chi_sim.lstm-unicharset d:\tesseract\chi_sim.txt >>>> >>>> >>>> 3. Generate starter traineddata file >>>> 1. Generate dictionary text file >>>> Refer to the 3 dictionary files in the d:\tesseract\tessdata_best >>>> folder (word text, number numbers, punc punctuation marks) >>>> 2. Generate starter traineddata file >>>> combine_lang_model --input_unicharset >>>> d:\tesseract\chi_sim.lstm-unicharset --lang chi_sim --script_dir >>>> d:\tesseract\langdata_lstm --output_dir d:\tesseract --version_str >>>> "CSDN:watt:2022.04[1,48,0,1C3,3Ft16Mp3,3TxyLfys64Lfx96RxLrx96Lfx512O1c4000]" >>>> --words d:\tesseract\word.txt --numbers d:\tesseract\number.txt --puncs >>>> d:\tesseract\punc.txt --pass_through_recoder >>>> >>>> >>>> 3. View the newly generated starter trained data information >>>> combine_tessdata -d d:\tesseract\chi_sim\chi_sim.traineddata >>>> >>>> 4. Generate training files >>>> 1. Generate the training text file train.txt >>>> >>>> 2. Generate picture+box file >>>> >>>> text2image --text d:\tesseract\train.txt --outputbase >>>> d:\tesseract\train --fonts_dir C:\Windows\Fonts --font="simhei" --ptsize 18 >>>> --fontconfig_tmpdir d:\tesseract\tmp >>>> 3. Generate training files: >>>> tesseract d:\tesseract\train.tif d:\tesseract\train -l chi_sim --psm 6 >>>> lstm.train >>>> >>>> 4. Create a new training list file >>>> Create a new d:\tesseract\train_listfile.txt file with the content >>>> d:\tesseract\train.lstmf >>>> 5. Training >>>> >>>> 2. Start training: >>>> lstmtraining --traineddata d:\tesseract\chi_sim\chi_sim.traineddata >>>> --net_spec "[1,48,0,1Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c4000]" >>>> --model_output d:\tesseract\output\output --train_listfile >>>> d:\tesseract\train_listfile.txt --max_iterations 0 --target_error_rate 0.01 >>>> --debug_interval -1 >>>> >>>> 6. Evaluate the generated checkpoint file >>>> 1. Generate evaluation text eval.txt >>>> Edit some evaluation text and save it to d:\tesseract\eval.txt, so as >>>> to cover it as comprehensively as possible and with a certain degree of >>>> complexity. >>>> 2. Generate picture+box file >>>> text2image --text d:\tesseract\eval.txt --outputbase d:\tesseract\eval >>>> --fonts_dir C:\Windows\Fonts --font="simhei" --ptsize 18 >>>> --fontconfig_tmpdir d:\tesseract\tmp >>>> 3. Generate evaluation lstmf file >>>> tesseract d:\tesseract\eval.tif d:\tesseract\eval -l chi_sim --psm 6 >>>> lstm.train >>>> 4. Generate evaluation list file >>>> Create a new d:\tesseract\eval_listfile.txt file with the content >>>> d:\tesseract\eval.lstmf >>>> 5. Start evaluating >>>> >>>> Start evaluating: >>>> lstmeval --model d:\tesseract\output\output_checkpoint --traineddata >>>> d:\tesseract\chi_sim\chi_sim.traineddata --eval_listfile >>>> d:\tesseract\eval_listfile.txt >>>> 7. Generate standard trained data >>>> 1. Generate a floating point (decimal) traineddata file (similar to >>>> tessdata_best) >>>> lstmtraining --stop_training --continue_from >>>> d:\tesseract\output\output_checkpoint --traineddata >>>> d:\tesseract\chi_sim\chi_sim.traineddata --model_output >>>> d:\tesseract\output\chi_sim.traineddata >>>> 2. Generate an integer traineddata file (similar to tessdata_fast) >>>> lstmtraining --stop_training --convert_to_int --continue_from >>>> d:\tesseract\output\output_checkpoint --traineddata >>>> d:\tesseract\chi_sim\chi_sim.traineddata --model_output >>>> d:\tesseract\output\chi_sim.traineddata >>>> >>>> 3. View the generated traineddata information >>>> combine_tessdata -d d:\tesseract\output\chi_sim.traineddata >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-oc...@googlegroups.com. >>>> To view this discussion visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/4f54b4ff-f1f4-4e44-9e49-11a70b759d68n%40googlegroups.com >>>> <https://groups.google.com/d/msgid/tesseract-ocr/4f54b4ff-f1f4-4e44-9e49-11a70b759d68n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> >> To view this discussion visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xMU7KTYPbKPTby09M6cEOvBbngD-U8hRyR%2BWZPF3_HhQ%40mail.gmail.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xMU7KTYPbKPTby09M6cEOvBbngD-U8hRyR%2BWZPF3_HhQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion visit > https://groups.google.com/d/msgid/tesseract-ocr/d7c9c21b-cc85-49c7-b7e3-2e40aad4cc35n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/d7c9c21b-cc85-49c7-b7e3-2e40aad4cc35n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/tesseract-ocr/CAB5aXsm-t9Cz6G1c1giBn_YqoR8BH4svUT7QdpSJWer9yrcyoQ%40mail.gmail.com.