Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

Shree Devi Kumar Sat, 10 Oct 2020 00:35:31 -0700

What command did you use?

Difficult to help without seeing what training data you used.


On Sat, Oct 10, 2020, 09:31 Fazle Rabbi <akafazlera...@gmail.com> wrote:

> Hi. I have a similar goal in mind about finetuning the 'ben' traineddata
> with the pictures i am working with. The picture will be an id so the names
> of people have to be recognized correctly. I tried the (line image,ground
> truth) way of finetuning the traineddata with very small number of images.
> The result was not good- I was kinda surprised as i expected at least the
> performance of the default model. My question is if i have a substantial
> amount of images and then process and produce the line image and ground
> truth from it- will that help me in improving the detection?
>
> On Sunday, September 27, 2020 at 9:21:17 PM UTC+6 Grad wrote:
>
>> @shree thank you for the advice, it was helpful. I managed to get
>> everything working satisfactorily: after adding additional training images,
>> I now get perfect results (446 pass, 0 fail)! Furthermore, these results
>> come with using the built-in "eng" model. I ended up not needing to
>> re-train or fine-tune Tesseract. The ticket was finding the magic sequence
>> of image processing steps to perform on my source images to prepare them
>> for input to Tesseract OCR
>>
>> I have battled with this problem since your response and have come close
>> to giving up more than once, thinking that perhaps Tesseract simply isn't
>> up to the task. But the limited character set and the uniformity of the
>> character appearances kept me going -- there just had to be a way to make
>> this work. I'd love to document all the things I tried, and what results
>> they gave, but there is just too much. A quick summary will have to suffice.
>>
>> *What got me close but ultimately didn't work*
>>
>>    - Resized my images so the text was 36px in height. I did this in
>>    Python using OpenCV and (wrongly I think) chose the cv2.INTER_AREA
>>    interpolation method.
>>    - Tried different values for MAX_ITERATIONS in tesstrain's Makefile,
>>    and got varied results but nothing perfect.
>>    - Downloaded
>>    
>> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/digits_comma.traineddata
>>    and used it for the START_MODEL of tesstrain's Makefile (also had to set
>>    TESSDATA for the Makefile)
>>    - Between these things, the best result I ever got was something like
>>    this (input on left, OCR output on right):
>>    21,485,000 -> 21,483,000
>>    21,875,000 -> 21,873,000
>>    24,995 -> 24,999
>>    5,450,000 -> 9,450,000
>>    591,958 -> 9591,958
>>    851 -> 8571
>>    851 -> 8571
>>    Pass: 428
>>    Fail: 7
>>    - So you can see, close, but still some pretty unforgivable errors
>>    (unforgivable to me due to the nature of my application -- these numbers
>>    need to be perfect)
>>
>> *What ultimately did work*
>>
>>    - In an act of desperation, and following a bit of a hunch, I
>>    abandoned trying to train/re-train/fine-tune, and just focused on getting
>>    perfect OCR on one of the images where it failed using "eng" model
>>       - I chose this file 1,000,000.png, which produced an empty string
>>       when ran through Tesseract
>>       - I used GIMP on Windows and opened 1,000,000.png and began
>>    adjusting/tweaking/filtering the image in various ways, each time 
>> re-trying
>>    the OCR to see if the result changed. Using GIMP was crucial because it
>>    allowed me to iterate through trying different image processing techniques
>>    using a GUI, which was much quicker than doing the same thing in Python
>>    using OpenCV.
>>    - Once I found what worked, I implemented it in Python. The magic
>>    steps ended up being:
>>       1. Read the source image as color:
>>       image_to_ocr = cv2.imread(raw_image_file_name, cv2.IMREAD_COLOR)
>>       2. Use only the green channel of the source image. The numbers in
>>       my source images are mostly green tinted and I thought maybe this would
>>       help. This results in a grayscale image with a dark background and 
>> white
>>       text:
>>       b, image_to_ocr, r = cv2.split(image_to_ocr)
>>       3. Enlarge the image by 2x. This resulted in text that is ~20px in
>>       height, and I found this to be necessary but sufficient. I also found 
>> the
>>       use of cv2.INTER_CUBIC instead of cv2.INTER_AREA to be crucial here. I
>>       think the resizing (enlarging in my case) of the images was an absolute
>>       must-have. I'm really thankful I posted here and really thankful to 
>> @shree
>>       for that little nugget of insight.
>>       image_to_ocr = cv2.resize(image_to_ocr, (image_to_ocr.shape[1] * 2,
>>       image_to_ocr.shape[0] * 2), interpolation = cv2.INTER_CUBIC)
>>       4. Invert the image so that the background is white and the text
>>       is black. I am not sure if this step was necessary.
>>       image_to_ocr = cv2.bitwise_not(image_to_ocr)
>>       - With these steps, 1,000,000.png OCR'd perfectly
>>    - I then re-ran my script to check accuracy on all 400+ source
>>    images, and got the perfect result. I was so nervous while the script was
>>    running; it prints out errors as it goes, and so many times before I'd run
>>    the script with eager anticipation that I'd finally gotten everything
>>    right, only to have an error appear. This time...it ran...seconds go
>>    by...more seconds go by...no errors...I can't look OMG...check back in 30
>>    seconds, 446 pass, 0 fail, I literally stood up and hooped and hollered
>>    with arms raised.
>>
>>
>> On Sunday, September 20, 2020 at 11:09:02 AM UTC-5 shree wrote:
>>
>>> Resize your images so that text is 36 pixels high. That's what is used
>>> for eng models.
>>>
>>> Since you are fine tuning, limit number of iterations to 400 or so (not
>>> 10000 which is default).
>>>
>>> Use dedug_level of -1 during training so that you can see the details
>>> per iteration.
>>>
>>>
>>>
>>> On Sun, Sep 20, 2020, 00:24 Grad <kes...@gmail.com> wrote:
>>>
>>>> I have fixed my ground-truth file creator script to eliminate the
>>>> badly-formed numbers and have re-run my experiment. Unfortunately, I am
>>>> still seeing really poor results (12 pass, 383 fail), even though the
>>>> training error rates appear to be much smaller this time around:
>>>>
>>>> At iteration 509/10000/10000, Mean rms=0.184%, delta=0.055%, char
>>>> train=0.344%, word train=2.5%, skip ratio=0%,  New worst char error = 0.344
>>>> wrote checkpoint.
>>>>
>>>> Finished! Error rate = 0.308
>>>> lstmtraining \
>>>> --stop_training \
>>>> --continue_from data/swtor/checkpoints/swtor_checkpoint \
>>>> --traineddata data/swtor/swtor.traineddata \
>>>> --model_output data/swtor.traineddata
>>>> Loaded file data/swtor/checkpoints/swtor_checkpoint, unpacking...
>>>>
>>>> Full log of "make training" is attached.
>>>>
>>>> When I run Tesseract using the "eng" and "swtor" models on the training
>>>> images, I'm seeing a the following types of results:
>>>>
>>>> "eng" model results for 638,997.png:
>>>>
>>>> > tesseract --psm 7 --oem 1 -c tessedit_char_whitelist=',0123456789' 
>>>> > 638,997.png
>>>> out
>>>> Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica
>>>> Warning: Invalid resolution 0 dpi. Using 70 instead.
>>>> > cat .\out.txt
>>>> 638,997
>>>>
>>>> "swtor" model results for 638,997.png:
>>>>
>>>> > tesseract --tessdata-dir -l swtor --psm 7 --oem 1 -c 
>>>> > tessedit_char_whitelist=',0123456789'
>>>> 638,997.png out
>>>> Failed to load any lstm-specific dictionaries for lang swtor!!
>>>> Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica
>>>> Warning: Invalid resolution 0 dpi. Using 70 instead.
>>>> > cat .\out.txt
>>>> 3,9,997
>>>>
>>>> In general, digits are more erroneous, and there is a proliferation of
>>>> commas.
>>>>
>>>> Do any other ideas come to mind? I appreciate your help Shree!
>>>>
>>>> On Saturday, September 19, 2020 at 12:12:19 PM UTC-5 Grad wrote:
>>>>
>>>>> If it turns out to be that simple, I will feel really relieved and
>>>>> really stupid at the same time. I cannot believe I didn't catch this 
>>>>> before
>>>>> posting. Thank you for taking a look, I'll fix my ground-truth file 
>>>>> creator
>>>>> script and try again.
>>>>>
>>>>> On Saturday, September 19, 2020 at 12:01:50 PM UTC-5 shree wrote:
>>>>>
>>>>>> You will get better results when you fix your training data (I
>>>>>> deleted all file names ending in -2 and -3).
>>>>>>
>>>>>> Mean rms=0.145%, delta=0.046%, train=0.214%(1.01%), skip ratio=0%
>>>>>> Iteration 396: GROUND  TRUTH : 5,500,000
>>>>>> File data/swtor-ground-truth/5,500,000.lstmf line 0 (Perfect):
>>>>>> Mean rms=0.145%, delta=0.046%, train=0.214%(1.008%), skip ratio=0%
>>>>>> Iteration 397: GROUND  TRUTH : 2,000,000
>>>>>> File data/swtor-ground-truth/2,000,000.lstmf line 0 (Perfect):
>>>>>> Mean rms=0.145%, delta=0.045%, train=0.213%(1.005%), skip ratio=0%
>>>>>> Iteration 398: GROUND  TRUTH : 6,435
>>>>>> File data/swtor-ground-truth/6,435.lstmf line 0 (Perfect):
>>>>>> Mean rms=0.145%, delta=0.045%, train=0.213%(1.003%), skip ratio=0%
>>>>>> Iteration 399: GROUND  TRUTH : 3,750,000
>>>>>> File data/swtor-ground-truth/3,750,000.lstmf line 0 (Perfect):
>>>>>> Mean rms=0.144%, delta=0.045%, train=0.212%(1%), skip ratio=0%
>>>>>> 2 Percent improvement time=4, best error was 100 @ 0
>>>>>> At iteration 4/400/400, Mean rms=0.144%, delta=0.045%, char
>>>>>> train=0.212%, word train=1%, skip ratio=0%,  New best char error = 0.212
>>>>>> wrote best model:data/swtor/checkpoints/swtor_0.212_4_400.checkpoint 
>>>>>> wrote
>>>>>> checkpoint.
>>>>>>
>>>>>> Iteration 400: GROUND  TRUTH : 5,222,100
>>>>>> File data/swtor-ground-truth/5,222,100.lstmf line 0 (Perfect):
>>>>>> Mean rms=0.144%, delta=0.045%, train=0.212%(0.998%), skip ratio=0%
>>>>>> Iteration 401: GROUND  TRUTH : 696,969
>>>>>> File data/swtor-ground-truth/696,969.lstmf line 0 (Perfect):
>>>>>> Mean rms=0.144%, delta=0.045%, train=0.211%(0.995%), skip ratio=0%
>>>>>> Iteration 402: GROUND  TRUTH : 71,000,000
>>>>>> File data/swtor-ground-truth/71,000,000.lstmf line 0 (Perfect):
>>>>>> Mean rms=0.144%, delta=0.045%, train=0.211%(0.993%), skip ratio=0%
>>>>>> Iteration 403: GROUND  TRUTH : 64,500
>>>>>> File data/swtor-ground-truth/64,500.lstmf line 0 (Perfect):
>>>>>> Mean rms=0.144%, delta=0.045%, train=0.21%(0.99%), skip ratio=0%
>>>>>> Iteration 404: GROUND  TRUTH : 39,500,000
>>>>>> File data/swtor-ground-truth/39,500,000.lstmf line 0 (Perfect):
>>>>>> Mean rms=0.144%, delta=0.045%, train=0.21%(0.988%), skip ratio=0%
>>>>>> Iteration 405: GROUND  TRUTH : 4,500,000
>>>>>> File data/swtor-ground-truth/4,500,000.lstmf line 0 (Perfect):
>>>>>> Mean rms=0.143%, delta=0.045%, train=0.209%(0.985%), skip ratio=0%
>>>>>> Iteration 406: GROUND  TRUTH : 1,450,000
>>>>>>
>>>>>>
>>>>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>>>>>>  Virus-free.
>>>>>> www.avg.com
>>>>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>>>>>> <#m_-2125701927703813766_m_-1362665791027190050_m_4573838550678158057_m_3745996810865765477_m_-8209654746249460667_m_-4693331455246237650_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>>>>>
>>>>>> On Sat, Sep 19, 2020 at 10:15 PM Shree Devi Kumar <shree...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> > Each of my PNG files have file names that indicate ground truth,
>>>>>>> and I have a little script that generates ground-truth TXT files from 
>>>>>>> the
>>>>>>> PNG file names.
>>>>>>>
>>>>>>> Please review your script. I notice a number of file names ending
>>>>>>> with -2. The gt.txt files for the same also contain -2 while the image 
>>>>>>> only
>>>>>>> has the number.
>>>>>>>
>>>>>>> Example files attached.
>>>>>>>
>>>>>>>
>>>>>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>>>>>>>  Virus-free.
>>>>>>> www.avg.com
>>>>>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>>>>>>> <#m_-2125701927703813766_m_-1362665791027190050_m_4573838550678158057_m_3745996810865765477_m_-8209654746249460667_m_-4693331455246237650_m_2830491266519781149_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> ____________________________________________________________
>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>
>>>>> --
>>>>
>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesseract-oc...@googlegroups.com.
>>>>
>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/70e5fed6-3035-4885-965c-0552560ef0f6n%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/70e5fed6-3035-4885-965c-0552560ef0f6n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/f20fef2a-367c-4b10-b1b5-f8349679b4edn%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/f20fef2a-367c-4b10-b1b5-f8349679b4edn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXWoChF6q1tL-HHyWeJ_AsfavDcJ03DfryksbP6dhO1eA%40mail.gmail.com.

Re: [tesseract-ocr] Fine-tuning via tesstrain repo gives me poorer results than built-in eng model

Reply via email to