If you have images of the cards with the corresponding text you could train it on the cropped/cleaned text directly.
Il giorno mer 30 gen 2019 alle ore 15:41 Daniel Ferenc <voo...@gmail.com> ha scritto: > So, I have figured out what was I doing wrong: > > - I am using tesseract packages I got from apt on ubuntu 18.04 LTS and > they were obviously missing some langdata which I downloaded from the > repository > - There was also a need to get the Latin.unicharsert file > - And finally I didn't notice an error in one of the late steps that > said radical-stroke.txt is missing and that resulted in traineddata not > getting generated for my tesstrain.sh script run > - And since the last step required the traineddata and I didn' t have one > so I used the package provided eng.traineddata which came with the package > and it all resultet in very poor recognition performance > > At this moment I'm running the training with a wordlist of possible ~13600 > words that can appear with ~100 fonts that can be used... Waiting for > 175000 iterations to finish because at 150k I stil had an error rate of ~2.4 > > (I'm creating a piece of software that should recognize Magic: the > Gathering card names. I have a database of all currently existing cards > (english ones) and created a word list of unique words that can appear in > their name and am training tesseract with these words with all the possible > fonts that were ever used for these cards. I will let you know how this > worked out once the training is done.) > > Thank you for your support. > > On Tuesday, January 29, 2019 at 6:40:14 PM UTC+1, shree wrote: >> >> Finetune with your specific font - see eg. below which uses IMPACT font. >> >> #!/bin/bash >> >> time ~/tesseract/src/training/tesstrain.sh \ >> --fonts_dir /usr/share/fonts \ >> --lang eng --linedata_only \ >> --noextract_font_properties \ >> --langdata_dir ~/langdata \ >> --tessdata_dir ~/tessdata \ >> --fontlist "Impact Condensed" \ >> --training_text ~/langdata/eng/eng.training_text \ >> --workspace_dir ~/tmp/ \ >> --save_box_tiff \ >> --output_dir ~/tesstutorial/engtrainfont >> >> time ~/tesseract/src/training/tesstrain.sh \ >> --fonts_dir /usr/share/fonts \ >> --lang eng --linedata_only \ >> --noextract_font_properties \ >> --langdata_dir ~/langdata \ >> --tessdata_dir ~/tessdata \ >> --fontlist "Impact Condensed" \ >> --training_text ~/langdata/eng/eng.mywordlist.training_text \ >> --workspace_dir ~/tmp/ \ >> --save_box_tiff \ >> --output_dir ~/tesstutorial/engevalwordlist >> >> # >> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact >> >> echo "/n ****** Finetune one of the fully-trained existing models: >> ***********" >> >> mkdir -p ~/tesstutorial/impact_from_full >> >> combine_tessdata -e ~/tessdata_best/eng.traineddata \ >> ~/tesstutorial/impact_from_full/eng.lstm >> >> time ~/tesseract/src/training/lstmtraining \ >> --model_output ~/tesstutorial/impact_from_full/impact \ >> --continue_from ~/tesstutorial/impact_from_full/eng.lstm \ >> --traineddata ~/tessdata_best/eng.traineddata \ >> --train_listfile ~/tesstutorial/engtrainfont/eng.training_files.txt \ >> --debug_interval -1 \ >> --max_iterations 400 >> >> echo -e "\n*********** eval on training data ******\n" >> >> time ~/tesseract/src/training/lstmeval \ >> --model ~/tesstutorial/impact_from_full/impact_checkpoint \ >> --traineddata ~/tessdata_best/eng.traineddata \ >> --eval_listfile ~/tesstutorial/engtrainfont/eng.training_files.txt >> >> echo -e "\n***********eval on eval data ******\n" >> >> time ~/tesseract/src/training/lstmeval \ >> --model ~/tesstutorial/impact_from_full/impact_checkpoint \ >> --traineddata ~/tessdata_best/eng.traineddata \ >> --eval_listfile ~/tesstutorial/engevalwordlist/eng.training_files.txt >> >> echo -e "\n*********** convert to traineddata ******\n" >> >> time ../tesseract/src/training/lstmtraining \ >> --stop_training \ >> --continue_from ~/tesstutorial/impact_from_full/impact_checkpoint \ >> --traineddata ~/tessdata_best/eng.traineddata \ >> --model_output ~/tesstutorial/engtrainfont/eng.traineddata >> >> >> On Mon, Jan 28, 2019 at 9:37 PM Daniel Ferenc <voo...@gmail.com> wrote: >> >>> Hi, >>> >>> I need to train Tesseract for only a specific wordlist (about 13600 >>> words) and one specific font. I tried following the training tutorial on >>> the Wiki but I'm not sure if i'm doing anything wrong - the traineddata >>> file is about 7 megabytes and i combined it with the eng.traineddata to get >>> any traineddata file because after finishing the training I had no >>> traineddata file at all. Can anyone please help me? >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> To post to this group, send email to tesser...@googlegroups.com. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/1909bad8-d28d-4660-812d-47d0310e67c2%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/1909bad8-d28d-4660-812d-47d0310e67c2%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> >> -- >> >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/72fd001f-137c-45b2-93c8-9f36d776e2f1%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/72fd001f-137c-45b2-93c8-9f36d776e2f1%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwkJMsDpXJcYp33qSFHajqP5hz8LOm3h0xCyE1OpvhY7Q%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.