Re: [tesseract-ocr] I tried to train a traineddata file myself, but encountered an [Error]

محمود محمد Thu, 05 Dec 2024 23:47:05 -0800

After completing the training for each image, click Save, then specify the
program path to the box files, then select the option to create a training
file using external boxes to automatically find that the training process
has been completed and your . traineddata. file has been created. To name
it and add it to the Tesseract-OCR database. In the tessdata file


في الجمعة، ٦ ديسمبر ٢٠٢٤، ١١:٤٤ ص محمود محمد <mahmoudmm55...@gmail.com> كتب:

> You can collect images and put them in a folder and then gettysboxaddtor
> to create a training file for your model from your collection of images. To
> start, first select the language and then create the box files by
> specifying the path of the images and then click on the Create box files
> box and then start training, creating and improving your model
>
> في الجمعة، ٦ ديسمبر ٢٠٢٤، ١١:٣٧ ص 鹿青年 <luqingnian1...@gmail.com> كتب:
>
>>
>> Thanks for the reply.
>> Yes, I also use jeTesBoxEditor at the same time, but jeTesBoxEditor is
>> more like data standardization. Some of the font files have incomplete
>> fonts. I want to use LSTM training to train a complete autologous library
>> file of my own.
>> 在2024年12月6日星期五 UTC+8 15:15:40<mahmoud...@gmail.com> 写道：
>>
>>> I think using jeTesBoxEditor is good for training process
>>>
>>> في الجمعة، ٦ ديسمبر ٢٠٢٤، ١١:٠٧ ص Zdenko Podobny <zde...@gmail.com> كتب:
>>>
>>>>
>>>> Error: Tesseract (legacy) engine requested, but components are not
>>>> present in /usr/local/share/tessdata/my_chi_sim.traineddata!!
>>>>
>>>>
>>>> The message is clear. YOU require tesseract to use legacy engine
>>>> explicitly but YOUR language datafile (you created by training) does not
>>>> contain legacy model.
>>>>
>>>> Zdenko
>>>>
>>>>
>>>> pi 6. 12. 2024 o 7:11 鹿青年 <luqingn...@gmail.com> napísal(a):
>>>>
>>>>> Hello, I tried to train a traineddata file myself, but an [Error]
>>>>> occurred during use. Could you please give me some guidance on how to
>>>>> resolve this error? Thank you very much.
>>>>> Perform OCR
>>>>> ···
>>>>> tesseract 0791.tif stdout -l my_chi_sim --psm 6 --oem 2
>>>>> ···
>>>>> The error content is:
>>>>> ····
>>>>> Error: Tesseract (legacy) engine requested, but components are not
>>>>> present in /usr/local/share/tessdata/my_chi_sim.traineddata!!
>>>>> Failed loading language 'my_chi_sim'
>>>>> Tesseract couldn't load any languages!
>>>>> Could not initialize tesseract.
>>>>> ····
>>>>>
>>>>> My training steps are as follows:
>>>>>
>>>>> Punctuation Dictionary:
>>>>> dawg2wordlist d:\tesseract\tessdata_best\chi_sim.lstm-unicharset
>>>>> d:\tesseract\tessdata_best\chi_sim.lstm-punc-dawg
>>>>> d:\tesseract\tessdata_best\punc.txt
>>>>>
>>>>>
>>>>> Let’s start with the key steps
>>>>> 2. Generate character set lstm-unicharset file
>>>>> 1. Generate character set txt file
>>>>>
>>>>> text2image --text d:\tesseract\chi_sim.txt --outputbase
>>>>> d:\tesseract\chi_sim --fonts_dir C:\Windows\Fonts --font="simhei"
>>>>> --fontconfig_tmpdir d:\tesseract\tmp
>>>>>
>>>>>
>>>>> 3. Generate character set lstm-unicharset file
>>>>>
>>>>> 1) Generate with box file
>>>>> unicharset_extractor --norm_mode 3 --output_unicharset
>>>>> d:\tesseract\chi_sim.lstm-unicharset d:\tesseract\chi_sim.box
>>>>>
>>>>> 2) Generate with txt file
>>>>> unicharset_extractor --norm_mode 3 --output_unicharset
>>>>> d:\tesseract\chi_sim.lstm-unicharset d:\tesseract\chi_sim.txt
>>>>>
>>>>>
>>>>> 3. Generate starter traineddata file
>>>>> 1. Generate dictionary text file
>>>>> Refer to the 3 dictionary files in the d:\tesseract\tessdata_best
>>>>> folder (word text, number numbers, punc punctuation marks)
>>>>> 2. Generate starter traineddata file
>>>>> combine_lang_model --input_unicharset
>>>>> d:\tesseract\chi_sim.lstm-unicharset --lang chi_sim --script_dir
>>>>> d:\tesseract\langdata_lstm --output_dir d:\tesseract --version_str
>>>>> "CSDN:watt:2022.04[1,48,0,1C3,3Ft16Mp3,3TxyLfys64Lfx96RxLrx96Lfx512O1c4000]"
>>>>> --words d:\tesseract\word.txt --numbers d:\tesseract\number.txt --puncs
>>>>> d:\tesseract\punc.txt --pass_through_recoder
>>>>>
>>>>>
>>>>> 3. View the newly generated starter trained data information
>>>>> combine_tessdata -d d:\tesseract\chi_sim\chi_sim.traineddata
>>>>>
>>>>> 4. Generate training files
>>>>> 1. Generate the training text file train.txt
>>>>>
>>>>> 2. Generate picture+box file
>>>>>
>>>>> text2image --text d:\tesseract\train.txt --outputbase
>>>>> d:\tesseract\train --fonts_dir C:\Windows\Fonts --font="simhei" --ptsize 
>>>>> 18
>>>>> --fontconfig_tmpdir d:\tesseract\tmp
>>>>> 3. Generate training files:
>>>>> tesseract d:\tesseract\train.tif d:\tesseract\train -l chi_sim --psm 6
>>>>> lstm.train
>>>>>
>>>>> 4. Create a new training list file
>>>>> Create a new d:\tesseract\train_listfile.txt file with the content
>>>>> d:\tesseract\train.lstmf
>>>>> 5. Training
>>>>>
>>>>> 2. Start training:
>>>>> lstmtraining --traineddata d:\tesseract\chi_sim\chi_sim.traineddata
>>>>> --net_spec "[1,48,0,1Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c4000]"
>>>>> --model_output d:\tesseract\output\output --train_listfile
>>>>> d:\tesseract\train_listfile.txt --max_iterations 0 --target_error_rate 
>>>>> 0.01
>>>>> --debug_interval -1
>>>>>
>>>>> 6. Evaluate the generated checkpoint file
>>>>> 1. Generate evaluation text eval.txt
>>>>> Edit some evaluation text and save it to d:\tesseract\eval.txt, so as
>>>>> to cover it as comprehensively as possible and with a certain degree of
>>>>> complexity.
>>>>> 2. Generate picture+box file
>>>>> text2image --text d:\tesseract\eval.txt --outputbase d:\tesseract\eval
>>>>> --fonts_dir C:\Windows\Fonts --font="simhei" --ptsize 18
>>>>> --fontconfig_tmpdir d:\tesseract\tmp
>>>>> 3. Generate evaluation lstmf file
>>>>> tesseract d:\tesseract\eval.tif d:\tesseract\eval -l chi_sim --psm 6
>>>>> lstm.train
>>>>> 4. Generate evaluation list file
>>>>> Create a new d:\tesseract\eval_listfile.txt file with the content
>>>>> d:\tesseract\eval.lstmf
>>>>> 5. Start evaluating
>>>>>
>>>>> Start evaluating:
>>>>> lstmeval --model d:\tesseract\output\output_checkpoint --traineddata
>>>>> d:\tesseract\chi_sim\chi_sim.traineddata --eval_listfile
>>>>> d:\tesseract\eval_listfile.txt
>>>>> 7. Generate standard trained data
>>>>> 1. Generate a floating point (decimal) traineddata file (similar to
>>>>> tessdata_best)
>>>>> lstmtraining --stop_training --continue_from
>>>>> d:\tesseract\output\output_checkpoint --traineddata
>>>>> d:\tesseract\chi_sim\chi_sim.traineddata --model_output
>>>>> d:\tesseract\output\chi_sim.traineddata
>>>>> 2. Generate an integer traineddata file (similar to tessdata_fast)
>>>>> lstmtraining --stop_training --convert_to_int --continue_from
>>>>> d:\tesseract\output\output_checkpoint --traineddata
>>>>> d:\tesseract\chi_sim\chi_sim.traineddata --model_output
>>>>> d:\tesseract\output\chi_sim.traineddata
>>>>>
>>>>> 3. View the generated traineddata information
>>>>> combine_tessdata -d d:\tesseract\output\chi_sim.traineddata
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>> To view this discussion visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/4f54b4ff-f1f4-4e44-9e49-11a70b759d68n%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/4f54b4ff-f1f4-4e44-9e49-11a70b759d68n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesseract-oc...@googlegroups.com.
>>>>
>>> To view this discussion visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xMU7KTYPbKPTby09M6cEOvBbngD-U8hRyR%2BWZPF3_HhQ%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xMU7KTYPbKPTby09M6cEOvBbngD-U8hRyR%2BWZPF3_HhQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion visit
>> https://groups.google.com/d/msgid/tesseract-ocr/d7c9c21b-cc85-49c7-b7e3-2e40aad4cc35n%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/d7c9c21b-cc85-49c7-b7e3-2e40aad4cc35n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAB5aXsk8p%3DcpCMdoMQivzwiBAO0F%3DC_3MaNNTy-7PRTqfVQSpA%40mail.gmail.com.

Re: [tesseract-ocr] I tried to train a traineddata file myself, but encountered an [Error]

Reply via email to