Thank you for sharing the results of your trial with fine-tuning and
getting better results with the official traineddata after pre-processing
the images.

Hope your notes will help other users with similar questions.

On Sun, Sep 27, 2020, 20:51 Grad <kes...@gmail.com> wrote:

> @shree thank you for the advice, it was helpful. I managed to get
> everything working satisfactorily: after adding additional training images,
> I now get perfect results (446 pass, 0 fail)! Furthermore, these results
> come with using the built-in "eng" model. I ended up not needing to
> re-train or fine-tune Tesseract. The ticket was finding the magic sequence
> of image processing steps to perform on my source images to prepare them
> for input to Tesseract OCR
>
> I have battled with this problem since your response and have come close
> to giving up more than once, thinking that perhaps Tesseract simply isn't
> up to the task. But the limited character set and the uniformity of the
> character appearances kept me going -- there just had to be a way to make
> this work. I'd love to document all the things I tried, and what results
> they gave, but there is just too much. A quick summary will have to suffice.
>
> *What got me close but ultimately didn't work*
>
>    - Resized my images so the text was 36px in height. I did this in
>    Python using OpenCV and (wrongly I think) chose the cv2.INTER_AREA
>    interpolation method.
>    - Tried different values for MAX_ITERATIONS in tesstrain's Makefile,
>    and got varied results but nothing perfect.
>    - Downloaded
>    
> https://github.com/Shreeshrii/tessdata_shreetest/blob/master/digits_comma.traineddata
>    and used it for the START_MODEL of tesstrain's Makefile (also had to set
>    TESSDATA for the Makefile)
>    - Between these things, the best result I ever got was something like
>    this (input on left, OCR output on right):
>    21,485,000 -> 21,483,000
>    21,875,000 -> 21,873,000
>    24,995 -> 24,999
>    5,450,000 -> 9,450,000
>    591,958 -> 9591,958
>    851 -> 8571
>    851 -> 8571
>    Pass: 428
>    Fail: 7
>    - So you can see, close, but still some pretty unforgivable errors
>    (unforgivable to me due to the nature of my application -- these numbers
>    need to be perfect)
>
> *What ultimately did work*
>
>    - In an act of desperation, and following a bit of a hunch, I
>    abandoned trying to train/re-train/fine-tune, and just focused on getting
>    perfect OCR on one of the images where it failed using "eng" model
>       - I chose this file 1,000,000.png, which produced an empty string
>       when ran through Tesseract
>       - I used GIMP on Windows and opened 1,000,000.png and began
>    adjusting/tweaking/filtering the image in various ways, each time re-trying
>    the OCR to see if the result changed. Using GIMP was crucial because it
>    allowed me to iterate through trying different image processing techniques
>    using a GUI, which was much quicker than doing the same thing in Python
>    using OpenCV.
>    - Once I found what worked, I implemented it in Python. The magic
>    steps ended up being:
>       1. Read the source image as color:
>       image_to_ocr = cv2.imread(raw_image_file_name, cv2.IMREAD_COLOR)
>       2. Use only the green channel of the source image. The numbers in
>       my source images are mostly green tinted and I thought maybe this would
>       help. This results in a grayscale image with a dark background and white
>       text:
>       b, image_to_ocr, r = cv2.split(image_to_ocr)
>       3. Enlarge the image by 2x. This resulted in text that is ~20px in
>       height, and I found this to be necessary but sufficient. I also found 
> the
>       use of cv2.INTER_CUBIC instead of cv2.INTER_AREA to be crucial here. I
>       think the resizing (enlarging in my case) of the images was an absolute
>       must-have. I'm really thankful I posted here and really thankful to 
> @shree
>       for that little nugget of insight.
>       image_to_ocr = cv2.resize(image_to_ocr, (image_to_ocr.shape[1] * 2,
>       image_to_ocr.shape[0] * 2), interpolation = cv2.INTER_CUBIC)
>       4. Invert the image so that the background is white and the text is
>       black. I am not sure if this step was necessary.
>       image_to_ocr = cv2.bitwise_not(image_to_ocr)
>       - With these steps, 1,000,000.png OCR'd perfectly
>    - I then re-ran my script to check accuracy on all 400+ source images,
>    and got the perfect result. I was so nervous while the script was running;
>    it prints out errors as it goes, and so many times before I'd run the
>    script with eager anticipation that I'd finally gotten everything right,
>    only to have an error appear. This time...it ran...seconds go by...more
>    seconds go by...no errors...I can't look OMG...check back in 30 seconds,
>    446 pass, 0 fail, I literally stood up and hooped and hollered with arms
>    raised.
>
>
> On Sunday, September 20, 2020 at 11:09:02 AM UTC-5 shree wrote:
>
>> Resize your images so that text is 36 pixels high. That's what is used
>> for eng models.
>>
>> Since you are fine tuning, limit number of iterations to 400 or so (not
>> 10000 which is default).
>>
>> Use dedug_level of -1 during training so that you can see the details per
>> iteration.
>>
>>
>>
>> On Sun, Sep 20, 2020, 00:24 Grad <kes...@gmail.com> wrote:
>>
>>> I have fixed my ground-truth file creator script to eliminate the
>>> badly-formed numbers and have re-run my experiment. Unfortunately, I am
>>> still seeing really poor results (12 pass, 383 fail), even though the
>>> training error rates appear to be much smaller this time around:
>>>
>>> At iteration 509/10000/10000, Mean rms=0.184%, delta=0.055%, char
>>> train=0.344%, word train=2.5%, skip ratio=0%,  New worst char error = 0.344
>>> wrote checkpoint.
>>>
>>> Finished! Error rate = 0.308
>>> lstmtraining \
>>> --stop_training \
>>> --continue_from data/swtor/checkpoints/swtor_checkpoint \
>>> --traineddata data/swtor/swtor.traineddata \
>>> --model_output data/swtor.traineddata
>>> Loaded file data/swtor/checkpoints/swtor_checkpoint, unpacking...
>>>
>>> Full log of "make training" is attached.
>>>
>>> When I run Tesseract using the "eng" and "swtor" models on the training
>>> images, I'm seeing a the following types of results:
>>>
>>> "eng" model results for 638,997.png:
>>>
>>> > tesseract --psm 7 --oem 1 -c tessedit_char_whitelist=',0123456789' 
>>> > 638,997.png
>>> out
>>> Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica
>>> Warning: Invalid resolution 0 dpi. Using 70 instead.
>>> > cat .\out.txt
>>> 638,997
>>>
>>> "swtor" model results for 638,997.png:
>>>
>>> > tesseract --tessdata-dir -l swtor --psm 7 --oem 1 -c 
>>> > tessedit_char_whitelist=',0123456789'
>>> 638,997.png out
>>> Failed to load any lstm-specific dictionaries for lang swtor!!
>>> Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica
>>> Warning: Invalid resolution 0 dpi. Using 70 instead.
>>> > cat .\out.txt
>>> 3,9,997
>>>
>>> In general, digits are more erroneous, and there is a proliferation of
>>> commas.
>>>
>>> Do any other ideas come to mind? I appreciate your help Shree!
>>>
>>> On Saturday, September 19, 2020 at 12:12:19 PM UTC-5 Grad wrote:
>>>
>>>> If it turns out to be that simple, I will feel really relieved and
>>>> really stupid at the same time. I cannot believe I didn't catch this before
>>>> posting. Thank you for taking a look, I'll fix my ground-truth file creator
>>>> script and try again.
>>>>
>>>> On Saturday, September 19, 2020 at 12:01:50 PM UTC-5 shree wrote:
>>>>
>>>>> You will get better results when you fix your training data (I deleted
>>>>> all file names ending in -2 and -3).
>>>>>
>>>>> Mean rms=0.145%, delta=0.046%, train=0.214%(1.01%), skip ratio=0%
>>>>> Iteration 396: GROUND  TRUTH : 5,500,000
>>>>> File data/swtor-ground-truth/5,500,000.lstmf line 0 (Perfect):
>>>>> Mean rms=0.145%, delta=0.046%, train=0.214%(1.008%), skip ratio=0%
>>>>> Iteration 397: GROUND  TRUTH : 2,000,000
>>>>> File data/swtor-ground-truth/2,000,000.lstmf line 0 (Perfect):
>>>>> Mean rms=0.145%, delta=0.045%, train=0.213%(1.005%), skip ratio=0%
>>>>> Iteration 398: GROUND  TRUTH : 6,435
>>>>> File data/swtor-ground-truth/6,435.lstmf line 0 (Perfect):
>>>>> Mean rms=0.145%, delta=0.045%, train=0.213%(1.003%), skip ratio=0%
>>>>> Iteration 399: GROUND  TRUTH : 3,750,000
>>>>> File data/swtor-ground-truth/3,750,000.lstmf line 0 (Perfect):
>>>>> Mean rms=0.144%, delta=0.045%, train=0.212%(1%), skip ratio=0%
>>>>> 2 Percent improvement time=4, best error was 100 @ 0
>>>>> At iteration 4/400/400, Mean rms=0.144%, delta=0.045%, char
>>>>> train=0.212%, word train=1%, skip ratio=0%,  New best char error = 0.212
>>>>> wrote best model:data/swtor/checkpoints/swtor_0.212_4_400.checkpoint wrote
>>>>> checkpoint.
>>>>>
>>>>> Iteration 400: GROUND  TRUTH : 5,222,100
>>>>> File data/swtor-ground-truth/5,222,100.lstmf line 0 (Perfect):
>>>>> Mean rms=0.144%, delta=0.045%, train=0.212%(0.998%), skip ratio=0%
>>>>> Iteration 401: GROUND  TRUTH : 696,969
>>>>> File data/swtor-ground-truth/696,969.lstmf line 0 (Perfect):
>>>>> Mean rms=0.144%, delta=0.045%, train=0.211%(0.995%), skip ratio=0%
>>>>> Iteration 402: GROUND  TRUTH : 71,000,000
>>>>> File data/swtor-ground-truth/71,000,000.lstmf line 0 (Perfect):
>>>>> Mean rms=0.144%, delta=0.045%, train=0.211%(0.993%), skip ratio=0%
>>>>> Iteration 403: GROUND  TRUTH : 64,500
>>>>> File data/swtor-ground-truth/64,500.lstmf line 0 (Perfect):
>>>>> Mean rms=0.144%, delta=0.045%, train=0.21%(0.99%), skip ratio=0%
>>>>> Iteration 404: GROUND  TRUTH : 39,500,000
>>>>> File data/swtor-ground-truth/39,500,000.lstmf line 0 (Perfect):
>>>>> Mean rms=0.144%, delta=0.045%, train=0.21%(0.988%), skip ratio=0%
>>>>> Iteration 405: GROUND  TRUTH : 4,500,000
>>>>> File data/swtor-ground-truth/4,500,000.lstmf line 0 (Perfect):
>>>>> Mean rms=0.143%, delta=0.045%, train=0.209%(0.985%), skip ratio=0%
>>>>> Iteration 406: GROUND  TRUTH : 1,450,000
>>>>>
>>>>>
>>>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>>>>>  Virus-free.
>>>>> www.avg.com
>>>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>>>>> <#m_4372719221266448850_m_4573838550678158057_m_3745996810865765477_m_-8209654746249460667_m_-4693331455246237650_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>>>>
>>>>> On Sat, Sep 19, 2020 at 10:15 PM Shree Devi Kumar <shree...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> > Each of my PNG files have file names that indicate ground truth,
>>>>>> and I have a little script that generates ground-truth TXT files from the
>>>>>> PNG file names.
>>>>>>
>>>>>> Please review your script. I notice a number of file names ending
>>>>>> with -2. The gt.txt files for the same also contain -2 while the image 
>>>>>> only
>>>>>> has the number.
>>>>>>
>>>>>> Example files attached.
>>>>>>
>>>>>>
>>>>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>>>>>>  Virus-free.
>>>>>> www.avg.com
>>>>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>>>>>> <#m_4372719221266448850_m_4573838550678158057_m_3745996810865765477_m_-8209654746249460667_m_-4693331455246237650_m_2830491266519781149_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> ____________________________________________________________
>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>
>>>> --
>>>
>> You received this message because you are subscribed to the Google Groups
>>> "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>>
>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/70e5fed6-3035-4885-965c-0552560ef0f6n%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/70e5fed6-3035-4885-965c-0552560ef0f6n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/d1e0a335-2de8-4892-872f-e3459f695a19n%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/d1e0a335-2de8-4892-872f-e3459f695a19n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduV7qvDAkqozNU9f69qzysSe%2BaKfHmg8__oVJYh8FJKSRg%40mail.gmail.com.

Reply via email to