Re: [tesseract-ocr] Training for a specific wordlist and font

Daniel Ferenc Thu, 31 Jan 2019 02:35:27 -0800

Is there a guide somewhere how to setup training like this? How to pair the 
images and text, etc..? And thank you for the insight, it really is helpful.


On Thursday, January 31, 2019 at 11:18:35 AM UTC+1, Lorenzo Blz wrote:
>
> Yes, generating text is faster and easier.
>
> But the real extracted and cleaned text you are going to eventually 
> recognize is going to be different from this, more or less depending on a 
> lot of factors:
> - how similar your training font actually is
> - how good your cleanup will be (test this in advance)
> - difference in text size, border, rotations, shearing from the generated 
> text (for example you train with 0px border and later provide text with 4px 
> border).
>
> Using the real data, in general, should be better, unless you have very 
> little data.
>
> If the real images differ from the generated ones you may try to add some 
> corruption mimicking the real one before the training: noise, perspective 
> deformations, small rotations, etc.
>
> And/or you can try to mix real and generated samples in the training.
>
> You say 90% of the samples are easy to process: these can be enough if you 
> can isolate these easily. Consider that real life samples will not be much 
> better than these (I suppose).
>
> About the rotations you can do perspective correction with opencv 
> findHomography or with hough lines.
>
> I realize this is A LOT of work as I'm doing this right now.
>
> If you have time, try different ways and see what works best.
>
>
>
> Bye
>
> Lorenzo
>
>
> Il giorno mer 30 gen 2019 alle ore 16:08 Daniel Ferenc <voo...@gmail.com 
> <javascript:>> ha scritto:
>
>> I'm not sure how exactly would I setup that (regarding tesseract 
>> training) BUT there are about 44000 (english) cards at this time and a high 
>> resolution image of each is about 2 megs (at least from the resource I can 
>> get them from). Also, not each card is the same format so a generic crop 
>> function would not work. Over 90% of the cards would be OK like this but 
>> the rest would cause issues. It's easier for me to try and teach tesseract 
>> this way and then have the software try different rotations/crops if the 
>> default one doesn't return anything meaningful in means of OCR. Just 
>> preparing the images for this is a massive task while retrieving the word 
>> list from the database was about 20 seconds, a minute to download the fonts 
>> and ~4 hours of training for a result that will be, hopefully, good enough.
>>
>> On Wednesday, January 30, 2019 at 3:53:43 PM UTC+1, Lorenzo Blz wrote:
>>>
>>>
>>> If you have images of the cards with the corresponding text you could 
>>> train it on the cropped/cleaned text directly.
>>>
>>> Il giorno mer 30 gen 2019 alle ore 15:41 Daniel Ferenc <voo...@gmail.com> 
>>> ha scritto:
>>>
>>>> So, I have figured out what was I doing wrong:
>>>>
>>>> - I am using tesseract packages I got from apt on ubuntu 18.04 LTS and 
>>>> they were obviously missing some langdata which I downloaded from the 
>>>> repository
>>>> - There was also a need to get the Latin.unicharsert file
>>>> - And finally I didn't notice an error in one of the late steps that 
>>>> said radical-stroke.txt is missing and that resulted in traineddata not 
>>>> getting generated for my tesstrain.sh script run
>>>> - And since the last step required the traineddata and I didn' t have 
>>>> one so I used the package provided eng.traineddata which came with the 
>>>> package and it all resultet in very poor recognition performance
>>>>
>>>> At this moment I'm running the training with a wordlist of possible 
>>>> ~13600 words that can appear with ~100 fonts that can be used... Waiting 
>>>> for 175000 iterations to finish because at 150k I stil had an error rate 
>>>> of 
>>>> ~2.4
>>>>
>>>> (I'm creating a piece of software that should recognize Magic: the 
>>>> Gathering card names. I have a database of all currently existing cards 
>>>> (english ones) and created a word list of unique words that can appear in 
>>>> their name and am training tesseract with these words with all the 
>>>> possible 
>>>> fonts that were ever used for these cards. I will let you know how this 
>>>> worked out once the training is done.)
>>>>
>>>> Thank you for your support.
>>>>
>>>> On Tuesday, January 29, 2019 at 6:40:14 PM UTC+1, shree wrote:
>>>>>
>>>>> Finetune with your specific font - see eg. below which uses IMPACT 
>>>>> font.
>>>>>
>>>>> #!/bin/bash
>>>>>
>>>>> time ~/tesseract/src/training/tesstrain.sh \
>>>>>   --fonts_dir /usr/share/fonts \
>>>>>   --lang eng --linedata_only \
>>>>>   --noextract_font_properties \
>>>>>   --langdata_dir ~/langdata \
>>>>>   --tessdata_dir ~/tessdata \
>>>>>   --fontlist "Impact Condensed" \
>>>>>   --training_text ~/langdata/eng/eng.training_text \
>>>>>   --workspace_dir ~/tmp/ \
>>>>>   --save_box_tiff \
>>>>>   --output_dir ~/tesstutorial/engtrainfont
>>>>>   
>>>>> time ~/tesseract/src/training/tesstrain.sh \
>>>>>   --fonts_dir /usr/share/fonts \
>>>>>   --lang eng --linedata_only \
>>>>>   --noextract_font_properties \
>>>>>   --langdata_dir ~/langdata \
>>>>>   --tessdata_dir ~/tessdata \
>>>>>   --fontlist "Impact Condensed" \
>>>>>   --training_text ~/langdata/eng/eng.mywordlist.training_text \
>>>>>   --workspace_dir ~/tmp/ \
>>>>>   --save_box_tiff \
>>>>>   --output_dir ~/tesstutorial/engevalwordlist
>>>>>   
>>>>> # 
>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for-impact
>>>>>
>>>>> echo "/n ****** Finetune one of the fully-trained existing models: 
>>>>> ***********"
>>>>>
>>>>> mkdir -p ~/tesstutorial/impact_from_full
>>>>>
>>>>> combine_tessdata -e ~/tessdata_best/eng.traineddata \
>>>>>   ~/tesstutorial/impact_from_full/eng.lstm
>>>>>   
>>>>> time ~/tesseract/src/training/lstmtraining \
>>>>>   --model_output ~/tesstutorial/impact_from_full/impact \
>>>>>   --continue_from ~/tesstutorial/impact_from_full/eng.lstm \
>>>>>   --traineddata ~/tessdata_best/eng.traineddata \
>>>>>   --train_listfile ~/tesstutorial/engtrainfont/eng.training_files.txt \
>>>>>   --debug_interval -1 \
>>>>>   --max_iterations 400
>>>>>   
>>>>> echo -e "\n*********** eval on training data ******\n"
>>>>>
>>>>> time ~/tesseract/src/training/lstmeval \
>>>>>   --model ~/tesstutorial/impact_from_full/impact_checkpoint \
>>>>>   --traineddata ~/tessdata_best/eng.traineddata \
>>>>>   --eval_listfile ~/tesstutorial/engtrainfont/eng.training_files.txt
>>>>>   
>>>>> echo -e "\n***********eval on eval data ******\n"
>>>>>   
>>>>> time ~/tesseract/src/training/lstmeval \
>>>>>   --model ~/tesstutorial/impact_from_full/impact_checkpoint \
>>>>>   --traineddata ~/tessdata_best/eng.traineddata \
>>>>>   --eval_listfile ~/tesstutorial/engevalwordlist/eng.training_files.txt
>>>>>   
>>>>> echo -e "\n*********** convert to traineddata  ******\n"
>>>>>
>>>>> time ../tesseract/src/training/lstmtraining \
>>>>>   --stop_training \
>>>>>   --continue_from ~/tesstutorial/impact_from_full/impact_checkpoint \
>>>>>   --traineddata ~/tessdata_best/eng.traineddata \
>>>>>   --model_output ~/tesstutorial/engtrainfont/eng.traineddata
>>>>>
>>>>>
>>>>> On Mon, Jan 28, 2019 at 9:37 PM Daniel Ferenc <voo...@gmail.com> 
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I need to train Tesseract for only a specific wordlist (about 13600 
>>>>>> words) and one specific font. I tried following the training tutorial on 
>>>>>> the Wiki but I'm not sure if i'm doing anything wrong - the traineddata 
>>>>>> file is about 7 megabytes and i combined it with the eng.traineddata to 
>>>>>> get 
>>>>>> any traineddata file because after finishing the training I had no 
>>>>>> traineddata file at all. Can anyone please help me?
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/1909bad8-d28d-4660-812d-47d0310e67c2%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1909bad8-d28d-4660-812d-47d0310e67c2%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>>
>>>>> ____________________________________________________________
>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/72fd001f-137c-45b2-93c8-9f36d776e2f1%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/72fd001f-137c-45b2-93c8-9f36d776e2f1%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com <javascript:>.
>> To post to this group, send email to tesser...@googlegroups.com 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/815c9bf1-cde1-4192-9e07-dde865df8c5f%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/815c9bf1-cde1-4192-9e07-dde865df8c5f%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a2beeae2-d433-44da-93e3-f20d9473e4c5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Training for a specific wordlist and font

Reply via email to