Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

Shree Devi Kumar Sat, 19 Sep 2020 02:08:58 -0700

Please share your training data so that we can test. Thanks.

<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
Virus-free.
www.avg.com
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>


On Sat, Sep 19, 2020 at 11:01 AM Gradalajage <kes...@gmail.com> wrote:

> I have 395 PNG files depicting numbers with commas. The images are 130x54
> pixels and are black text on white background. Here is an example of an
> image showing the number 638,997:
> [image: 638,997.png]
> I would like to use Tesseract to perform reliable OCR on these images and
> others like them. Out-of-the-box, Tesseract correctly extracts text for 344
> of these images, and fails in some manner on 51 of them. I am using the
> following command line for each image:
>
> > tesseract --psm 7 --oem 1 -c tessedit_char_whitelist=',0123456789'
> {filename}.png out
>
> I run that command on each image, substituting {filename} as needed. Each
> invocation of that command produces the following output:
>
> Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica
> Warning: Invalid resolution 0 dpi. Using 70 instead.
>
> 344/395 is an 87% success rate, but I want to try for better. So, I am
> attempting to "fine-tune" Tesseract by running through the instructions for
> tesstrain at https://github.com/tesseract-ocr/tesstrain. Each of my PNG
> files have file names that indicate ground truth, and I have a little
> script that generates ground-truth TXT files from the PNG file names. I
> have chosen "swtor" as the model name. I can then run this command from the
> tesstrain root directory:
>
> $ make training MODEL_NAME=swtor START_MODEL=eng PSM=7
>
> This command runs, prints lots of info, and eventually produces the
> following output, just before it ends:
>
> Finished! Error rate = 2.739
> lstmtraining \
> --stop_training \
> --continue_from data/swtor/checkpoints/swtor_checkpoint \
> --traineddata data/swtor/swtor.traineddata \
> --model_output data/swtor.traineddata
> Loaded file data/swtor/checkpoints/swtor_checkpoint, unpacking...
>
> I can then take the resulting swtor.traineddata file, copy it to my
> tessdata directory, and then re-run my experiment from earlier, with a
> command line that looks like this:
>
> > tesseract -l swtor --psm 7 --oem 1 -c tessedit_char_whitelist=
> ',0123456789' {filename}.png out
>
> With the new swtor model, Tesseract correctly extracts text for 64 of
> these images, and fails in some manner on 331 of them.
> 64/395 is a 16% success rate, down from 87% for the eng model.
> So, the swtor model I trained does far worse, which I find surprising and
> unexpected. I think I might be doing something wrong but do not really know
> what next steps to take to continue troubleshooting this. I'm hoping to
> post here and get help from someone knowledgeable about the training
> process.
>
> I can post the contents of the "data" directory in my tesstrain repo root
> directory if that is helpful for anyone (I'd have to remove the
> checkpoints).
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/a1ed5d91-6b2a-40c4-8eca-88cf6e7ebdd0n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/a1ed5d91-6b2a-40c4-8eca-88cf6e7ebdd0n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>


-- 

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
Virus-free.
www.avg.com
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWmOLvx1A_U6q5YGdZrZNBYCxpjHUMxnK0%3DX3Y_ZLrHug%40mail.gmail.com.

Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

Reply via email to