I want my application able to recognize characters like: 'Φ' Vào 00:56:01 UTC+7 Thứ Bảy, ngày 20 tháng 10 năm 2018, tu tonquang đã viết: > > Hi, > > *I have some errors when I follow this tutorial to retrain tesseract: * > > I follow this link to retrain tesseract with my image dataset (I retrain > tesseract with real image, not from text file via tesstrain.sh) > > https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-starter-traineddata > > It is my steps to retrain tesseract lstm: > > > *Step1: I create my training data (tif image + box file) from my images.* > I generated its via this command line: tesseract > [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop > makebox > > > *Step2: I edit manually by Qt-box-edito*r. (I done with this link: > https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract-%E2%80%93-Make-Box-Files > ) > So now I have files: > .tif file > .box file > .lstmf file (generated by command: tesseract > [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] lstm.train > unicharset file > > > *Step 3: I create .traineddata via this command:* > combine_lang_model --input_unicharset unicharset --script_dir langdata > --output_dir output --lang "eng" > With langdata I downloaded from here: > https://github.com/tesseract-ocr/langdata > > > *Step4: I extract existing model from exist traineddata by command:* > combine_tessdata -e /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata > eng.lstm > > > *Step5: I retrain tesseract *(Fine Tuning for ± a few characters: > https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters) > > by command: > lstmtraining --model_output output_model --continue_from eng.lstm > --traineddata output_basic/eng/eng.traineddata --old_traineddata /usr/share > tesseract-ocr/4.00/tessdata/eng.traineddata --train_listfile > eng.training_files.txt --debug_interval -1 --max_iterations 400 > > - It is format of my eng.training_files.txt: > path/to/lstmf > > *I get an error like the following:* > > [image: Screenshot from 2018-10-19 21-49-00.png] > *It is example about my training image:* > [image: eng.centurygothic.exp0.png] > > > > > > *I try to retrain tesseract with from real image (not from text file via > tesstrain.sh)* > > Please share me something if you have any idea to fix it. > > > Thank you for advance ! > > >
-- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d08df2e0-ccc3-49bc-90ab-6588f9ab6ef3%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.