Re: [tesseract-ocr] Training for a specific wordlist and font

Daniel Ferenc Wed, 30 Jan 2019 07:10:35 -0800

Oh, and one more thing - the same card with the same name can appear in 
different editions of Magic, so pure recognition by name is not enough, I'm 
also training my software to recognize the edition of the card by using 
different means so all that in combination should be quite enough.


On Wednesday, January 30, 2019 at 4:08:37 PM UTC+1, Daniel Ferenc wrote:
>
> I'm not sure how exactly would I setup that (regarding tesseract training) 
> BUT there are about 44000 (english) cards at this time and a high 
> resolution image of each is about 2 megs (at least from the resource I can 
> get them from). Also, not each card is the same format so a generic crop 
> function would not work. Over 90% of the cards would be OK like this but 
> the rest would cause issues. It's easier for me to try and teach tesseract 
> this way and then have the software try different rotations/crops if the 
> default one doesn't return anything meaningful in means of OCR. Just 
> preparing the images for this is a massive task while retrieving the word 
> list from the database was about 20 seconds, a minute to download the fonts 
> and ~4 hours of training for a result that will be, hopefully, good enough.
>
> On Wednesday, January 30, 2019 at 3:53:43 PM UTC+1, Lorenzo Blz wrote:
>>
>>
>> If you have images of the cards with the corresponding text you could 
>> train it on the cropped/cleaned text directly.
>>
>> Il giorno mer 30 gen 2019 alle ore 15:41 Daniel Ferenc <voo...@gmail.com> 
>> ha scritto:
>>
>>> So, I have figured out what was I doing wrong:
>>>
>>> - I am using tesseract packages I got from apt on ubuntu 18.04 LTS and 
>>> they were obviously missing some langdata which I downloaded from the 
>>> repository
>>> - There was also a need to get the Latin.unicharsert file
>>> - And finally I didn't notice an error in one of the late steps that 
>>> said radical-stroke.txt is missing and that resulted in traineddata not 
>>> getting generated for my tesstrain.sh script run
>>> - And since the last step required the traineddata and I didn' t have 
>>> one so I used the package provided eng.traineddata which came with the 
>>> package and it all resultet in very poor recognition performance
>>>
>>> At this moment I'm running the training with a wordlist of possible 
>>> ~13600 words that can appear with ~100 fonts that can be used... Waiting 
>>> for 175000 iterations to finish because at 150k I stil had an error rate of 
>>> ~2.4
>>>
>>> (I'm creating a piece of software that should recognize Magic: the 
>>> Gathering card names. I have a database of all currently existing cards 
>>> (english ones) and created a word list of unique words that can appear in 
>>> their name and am training tesseract with these words with all the possible 
>>> fonts that were ever used for these cards. I will let you know how this 
>>> worked out once the training is done.)
>>>
>>> Thank you for your support.
>>>
>>> On Tuesday, January 29, 2019 at 6:40:14 PM UTC+1, shree wrote:
>>>>
>>>> Finetune with your specific font - see eg. below which uses IMPACT font.
>>>>
>>>> #!/bin/bash
>>>>
>>>> time ~/tesseract/src/training/tesstrain.sh \
>>>>   --fonts_dir /usr/share/fonts \
>>>>   --lang eng --linedata_only \
>>>>   --noextract_font_properties \
>>>>   --langdata_dir ~/langdata \
>>>>   --tessdata_dir ~/tessdata \
>>>>   --fontlist "Impact Condensed" \
>>>>   --training_text ~/langdata/eng/eng.training_text \
>>>>   --workspace_dir ~/tmp/ \
>>>>   --save_box_tiff \
>>>>   --output_dir ~/tesstutorial/engtrainfont
>>>>   
>>>> time ~/tesseract/src/training/tesstrain.sh \
>>>>   --fonts_dir /usr/share/fonts \
>>>>   --lang eng --linedata_only \
>>>>   --noextract_font_properties \
>>>>   --langdata_dir ~/langdata \
>>>>   --tessdata_dir ~/tessdata \
>>>>   --fontlist "Impact Condensed" \
>>>>   --training_text ~/langdata/eng/eng.mywordlist.training_text \
>>>>   --workspace_dir ~/tmp/ \
>>>>   --save_box_tiff \
>>>>   --output_dir ~/tesstutorial/engevalwordlist
>>>>   
>>>> # 
>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact
>>>>
>>>> echo "/n ****** Finetune one of the fully-trained existing models: 
>>>> ***********"
>>>>
>>>> mkdir -p ~/tesstutorial/impact_from_full
>>>>
>>>> combine_tessdata -e ~/tessdata_best/eng.traineddata \
>>>>   ~/tesstutorial/impact_from_full/eng.lstm
>>>>   
>>>> time ~/tesseract/src/training/lstmtraining \
>>>>   --model_output ~/tesstutorial/impact_from_full/impact \
>>>>   --continue_from ~/tesstutorial/impact_from_full/eng.lstm \
>>>>   --traineddata ~/tessdata_best/eng.traineddata \
>>>>   --train_listfile ~/tesstutorial/engtrainfont/eng.training_files.txt \
>>>>   --debug_interval -1 \
>>>>   --max_iterations 400
>>>>   
>>>> echo -e "\n*********** eval on training data ******\n"
>>>>
>>>> time ~/tesseract/src/training/lstmeval \
>>>>   --model ~/tesstutorial/impact_from_full/impact_checkpoint \
>>>>   --traineddata ~/tessdata_best/eng.traineddata \
>>>>   --eval_listfile ~/tesstutorial/engtrainfont/eng.training_files.txt
>>>>   
>>>> echo -e "\n***********eval on eval data ******\n"
>>>>   
>>>> time ~/tesseract/src/training/lstmeval \
>>>>   --model ~/tesstutorial/impact_from_full/impact_checkpoint \
>>>>   --traineddata ~/tessdata_best/eng.traineddata \
>>>>   --eval_listfile ~/tesstutorial/engevalwordlist/eng.training_files.txt
>>>>   
>>>> echo -e "\n*********** convert to traineddata  ******\n"
>>>>
>>>> time ../tesseract/src/training/lstmtraining \
>>>>   --stop_training \
>>>>   --continue_from ~/tesstutorial/impact_from_full/impact_checkpoint \
>>>>   --traineddata ~/tessdata_best/eng.traineddata \
>>>>   --model_output ~/tesstutorial/engtrainfont/eng.traineddata
>>>>
>>>>
>>>> On Mon, Jan 28, 2019 at 9:37 PM Daniel Ferenc <voo...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I need to train Tesseract for only a specific wordlist (about 13600 
>>>>> words) and one specific font. I tried following the training tutorial on 
>>>>> the Wiki but I'm not sure if i'm doing anything wrong - the traineddata 
>>>>> file is about 7 megabytes and i combined it with the eng.traineddata to 
>>>>> get 
>>>>> any traineddata file because after finishing the training I had no 
>>>>> traineddata file at all. Can anyone please help me?
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/1909bad8-d28d-4660-812d-47d0310e67c2%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1909bad8-d28d-4660-812d-47d0310e67c2%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>
>>>> -- 
>>>>
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/72fd001f-137c-45b2-93c8-9f36d776e2f1%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/72fd001f-137c-45b2-93c8-9f36d776e2f1%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/fa707a82-ae41-4fa5-a9be-aa5b6eaf189f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Training for a specific wordlist and font

Reply via email to