Hi,

*I have some errors when I follow this tutorial to retrain tesseract: *

I follow this link to retrain tesseract with my image dataset (I retrain 
tesseract with real image, not from text file via tesstrain.sh)
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-starter-traineddata

It is my steps to retrain tesseract lstm:


*Step1: I create my training data (tif image + box file) from my images.*
I generated its via this command line: tesseract 
[lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop 
makebox


*Step2: I edit manually by Qt-box-edito*r. (I done with this link: 
https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-Make-Box-Files
)
So now I have files:
.tif file
.box file
.lstmf file (generated by command: tesseract [lang].[fontname].exp[num].tif 
[lang].[fontname].exp[num] lstm.train
unicharset file


*Step 3: I create .traineddata via this command:*
combine_lang_model --input_unicharset unicharset --script_dir langdata 
--output_dir output --lang "eng"
With langdata I downloaded from here: 
https://github.com/tesseract-ocr/langdata


*Step4: I extract existing model from exist traineddata by command:*
combine_tessdata -e /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata 
eng.lstm


*Step5: I retrain tesseract *(Fine Tuning for ± a few characters: 
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters)
 
by command:
lstmtraining --model_output output_model --continue_from eng.lstm 
--traineddata output_basic/eng/eng.traineddata --old_traineddata /usr/share 
tesseract-ocr/4.00/tessdata/eng.traineddata --train_listfile 
eng.training_files.txt --debug_interval -1 --max_iterations 400

   - It is format of my eng.training_files.txt:
   path/to/lstmf

*I get an error like the following:*

[image: Screenshot from 2018-10-19 21-49-00.png]
*It is example about my training image:*
[image: eng.centurygothic.exp0.png] <about:invalid#zClosurez>





*I try to retrain tesseract with from real image (not from text file via 
tesstrain.sh)*

Please share me something if you have any idea to fix it.


Thank you for advance !


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/486c4972-e0e1-4ccc-a59f-0f1dd9eb55b8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to