Re: [tesseract-ocr] I tried to train a traineddata file myself, but encountered an [Error]

محمود محمد Wed, 11 Dec 2024 05:21:19 -0800

Hello I want make or generated with you a simple file trainddata by
jtessboxeditor for Tesseract and test it can you inform me time to discuss
The steps.  Thanks


في الجمعة، ٦ ديسمبر ٢٠٢٤، ١٠:١١ ص 鹿青年 <luqingnian1...@gmail.com> كتب:

> Hello, I tried to train a traineddata file myself, but an [Error] occurred
> during use. Could you please give me some guidance on how to resolve this
> error? Thank you very much.
> Perform OCR
> ···
> tesseract 0791.tif stdout -l my_chi_sim --psm 6 --oem 2
> ···
> The error content is:
> ····
> Error: Tesseract (legacy) engine requested, but components are not present
> in /usr/local/share/tessdata/my_chi_sim.traineddata!!
> Failed loading language 'my_chi_sim'
> Tesseract couldn't load any languages!
> Could not initialize tesseract.
> ····
>
> My training steps are as follows:
>
> Punctuation Dictionary:
> dawg2wordlist d:\tesseract\tessdata_best\chi_sim.lstm-unicharset
> d:\tesseract\tessdata_best\chi_sim.lstm-punc-dawg
> d:\tesseract\tessdata_best\punc.txt
>
>
> Let’s start with the key steps
> 2. Generate character set lstm-unicharset file
> 1. Generate character set txt file
>
> text2image --text d:\tesseract\chi_sim.txt --outputbase
> d:\tesseract\chi_sim --fonts_dir C:\Windows\Fonts --font="simhei"
> --fontconfig_tmpdir d:\tesseract\tmp
>
>
> 3. Generate character set lstm-unicharset file
>
> 1) Generate with box file
> unicharset_extractor --norm_mode 3 --output_unicharset
> d:\tesseract\chi_sim.lstm-unicharset d:\tesseract\chi_sim.box
>
> 2) Generate with txt file
> unicharset_extractor --norm_mode 3 --output_unicharset
> d:\tesseract\chi_sim.lstm-unicharset d:\tesseract\chi_sim.txt
>
>
> 3. Generate starter traineddata file
> 1. Generate dictionary text file
> Refer to the 3 dictionary files in the d:\tesseract\tessdata_best folder
> (word text, number numbers, punc punctuation marks)
> 2. Generate starter traineddata file
> combine_lang_model --input_unicharset d:\tesseract\chi_sim.lstm-unicharset
> --lang chi_sim --script_dir d:\tesseract\langdata_lstm --output_dir
> d:\tesseract --version_str
> "CSDN:watt:2022.04[1,48,0,1C3,3Ft16Mp3,3TxyLfys64Lfx96RxLrx96Lfx512O1c4000]"
> --words d:\tesseract\word.txt --numbers d:\tesseract\number.txt --puncs
> d:\tesseract\punc.txt --pass_through_recoder
>
>
> 3. View the newly generated starter trained data information
> combine_tessdata -d d:\tesseract\chi_sim\chi_sim.traineddata
>
> 4. Generate training files
> 1. Generate the training text file train.txt
>
> 2. Generate picture+box file
>
> text2image --text d:\tesseract\train.txt --outputbase d:\tesseract\train
> --fonts_dir C:\Windows\Fonts --font="simhei" --ptsize 18
> --fontconfig_tmpdir d:\tesseract\tmp
> 3. Generate training files:
> tesseract d:\tesseract\train.tif d:\tesseract\train -l chi_sim --psm 6
> lstm.train
>
> 4. Create a new training list file
> Create a new d:\tesseract\train_listfile.txt file with the content
> d:\tesseract\train.lstmf
> 5. Training
>
> 2. Start training:
> lstmtraining --traineddata d:\tesseract\chi_sim\chi_sim.traineddata
> --net_spec "[1,48,0,1Ct3,3,16 Mp3,3 Lfys64 Lfx96 Lrx96 Lfx512 O1c4000]"
> --model_output d:\tesseract\output\output --train_listfile
> d:\tesseract\train_listfile.txt --max_iterations 0 --target_error_rate 0.01
> --debug_interval -1
>
> 6. Evaluate the generated checkpoint file
> 1. Generate evaluation text eval.txt
> Edit some evaluation text and save it to d:\tesseract\eval.txt, so as to
> cover it as comprehensively as possible and with a certain degree of
> complexity.
> 2. Generate picture+box file
> text2image --text d:\tesseract\eval.txt --outputbase d:\tesseract\eval
> --fonts_dir C:\Windows\Fonts --font="simhei" --ptsize 18
> --fontconfig_tmpdir d:\tesseract\tmp
> 3. Generate evaluation lstmf file
> tesseract d:\tesseract\eval.tif d:\tesseract\eval -l chi_sim --psm 6
> lstm.train
> 4. Generate evaluation list file
> Create a new d:\tesseract\eval_listfile.txt file with the content
> d:\tesseract\eval.lstmf
> 5. Start evaluating
>
> Start evaluating:
> lstmeval --model d:\tesseract\output\output_checkpoint --traineddata
> d:\tesseract\chi_sim\chi_sim.traineddata --eval_listfile
> d:\tesseract\eval_listfile.txt
> 7. Generate standard trained data
> 1. Generate a floating point (decimal) traineddata file (similar to
> tessdata_best)
> lstmtraining --stop_training --continue_from
> d:\tesseract\output\output_checkpoint --traineddata
> d:\tesseract\chi_sim\chi_sim.traineddata --model_output
> d:\tesseract\output\chi_sim.traineddata
> 2. Generate an integer traineddata file (similar to tessdata_fast)
> lstmtraining --stop_training --convert_to_int --continue_from
> d:\tesseract\output\output_checkpoint --traineddata
> d:\tesseract\chi_sim\chi_sim.traineddata --model_output
> d:\tesseract\output\chi_sim.traineddata
>
> 3. View the generated traineddata information
> combine_tessdata -d d:\tesseract\output\chi_sim.traineddata
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion visit
> https://groups.google.com/d/msgid/tesseract-ocr/4f54b4ff-f1f4-4e44-9e49-11a70b759d68n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/4f54b4ff-f1f4-4e44-9e49-11a70b759d68n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAB5aXsnXG%3DDnNnkZPmO6kopRThYoibnu-5_TV%3DN5DknuCPgBog%40mail.gmail.com.

Re: [tesseract-ocr] I tried to train a traineddata file myself, but encountered an [Error]

Reply via email to