Resize your images so that text is 36 pixels high. That's what is used for eng models.
Since you are fine tuning, limit number of iterations to 400 or so (not 10000 which is default). Use dedug_level of -1 during training so that you can see the details per iteration. On Sun, Sep 20, 2020, 00:24 Grad <kes...@gmail.com> wrote: > I have fixed my ground-truth file creator script to eliminate the > badly-formed numbers and have re-run my experiment. Unfortunately, I am > still seeing really poor results (12 pass, 383 fail), even though the > training error rates appear to be much smaller this time around: > > At iteration 509/10000/10000, Mean rms=0.184%, delta=0.055%, char > train=0.344%, word train=2.5%, skip ratio=0%, New worst char error = 0.344 > wrote checkpoint. > > Finished! Error rate = 0.308 > lstmtraining \ > --stop_training \ > --continue_from data/swtor/checkpoints/swtor_checkpoint \ > --traineddata data/swtor/swtor.traineddata \ > --model_output data/swtor.traineddata > Loaded file data/swtor/checkpoints/swtor_checkpoint, unpacking... > > Full log of "make training" is attached. > > When I run Tesseract using the "eng" and "swtor" models on the training > images, I'm seeing a the following types of results: > > "eng" model results for 638,997.png: > > > tesseract --psm 7 --oem 1 -c tessedit_char_whitelist=',0123456789' > > 638,997.png > out > Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica > Warning: Invalid resolution 0 dpi. Using 70 instead. > > cat .\out.txt > 638,997 > > "swtor" model results for 638,997.png: > > > tesseract --tessdata-dir -l swtor --psm 7 --oem 1 -c > > tessedit_char_whitelist=',0123456789' > 638,997.png out > Failed to load any lstm-specific dictionaries for lang swtor!! > Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica > Warning: Invalid resolution 0 dpi. Using 70 instead. > > cat .\out.txt > 3,9,997 > > In general, digits are more erroneous, and there is a proliferation of > commas. > > Do any other ideas come to mind? I appreciate your help Shree! > > On Saturday, September 19, 2020 at 12:12:19 PM UTC-5 Grad wrote: > >> If it turns out to be that simple, I will feel really relieved and really >> stupid at the same time. I cannot believe I didn't catch this before >> posting. Thank you for taking a look, I'll fix my ground-truth file creator >> script and try again. >> >> On Saturday, September 19, 2020 at 12:01:50 PM UTC-5 shree wrote: >> >>> You will get better results when you fix your training data (I deleted >>> all file names ending in -2 and -3). >>> >>> Mean rms=0.145%, delta=0.046%, train=0.214%(1.01%), skip ratio=0% >>> Iteration 396: GROUND TRUTH : 5,500,000 >>> File data/swtor-ground-truth/5,500,000.lstmf line 0 (Perfect): >>> Mean rms=0.145%, delta=0.046%, train=0.214%(1.008%), skip ratio=0% >>> Iteration 397: GROUND TRUTH : 2,000,000 >>> File data/swtor-ground-truth/2,000,000.lstmf line 0 (Perfect): >>> Mean rms=0.145%, delta=0.045%, train=0.213%(1.005%), skip ratio=0% >>> Iteration 398: GROUND TRUTH : 6,435 >>> File data/swtor-ground-truth/6,435.lstmf line 0 (Perfect): >>> Mean rms=0.145%, delta=0.045%, train=0.213%(1.003%), skip ratio=0% >>> Iteration 399: GROUND TRUTH : 3,750,000 >>> File data/swtor-ground-truth/3,750,000.lstmf line 0 (Perfect): >>> Mean rms=0.144%, delta=0.045%, train=0.212%(1%), skip ratio=0% >>> 2 Percent improvement time=4, best error was 100 @ 0 >>> At iteration 4/400/400, Mean rms=0.144%, delta=0.045%, char >>> train=0.212%, word train=1%, skip ratio=0%, New best char error = 0.212 >>> wrote best model:data/swtor/checkpoints/swtor_0.212_4_400.checkpoint wrote >>> checkpoint. >>> >>> Iteration 400: GROUND TRUTH : 5,222,100 >>> File data/swtor-ground-truth/5,222,100.lstmf line 0 (Perfect): >>> Mean rms=0.144%, delta=0.045%, train=0.212%(0.998%), skip ratio=0% >>> Iteration 401: GROUND TRUTH : 696,969 >>> File data/swtor-ground-truth/696,969.lstmf line 0 (Perfect): >>> Mean rms=0.144%, delta=0.045%, train=0.211%(0.995%), skip ratio=0% >>> Iteration 402: GROUND TRUTH : 71,000,000 >>> File data/swtor-ground-truth/71,000,000.lstmf line 0 (Perfect): >>> Mean rms=0.144%, delta=0.045%, train=0.211%(0.993%), skip ratio=0% >>> Iteration 403: GROUND TRUTH : 64,500 >>> File data/swtor-ground-truth/64,500.lstmf line 0 (Perfect): >>> Mean rms=0.144%, delta=0.045%, train=0.21%(0.99%), skip ratio=0% >>> Iteration 404: GROUND TRUTH : 39,500,000 >>> File data/swtor-ground-truth/39,500,000.lstmf line 0 (Perfect): >>> Mean rms=0.144%, delta=0.045%, train=0.21%(0.988%), skip ratio=0% >>> Iteration 405: GROUND TRUTH : 4,500,000 >>> File data/swtor-ground-truth/4,500,000.lstmf line 0 (Perfect): >>> Mean rms=0.143%, delta=0.045%, train=0.209%(0.985%), skip ratio=0% >>> Iteration 406: GROUND TRUTH : 1,450,000 >>> >>> >>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> >>> Virus-free. >>> www.avg.com >>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> >>> <#m_3745996810865765477_m_-8209654746249460667_m_-4693331455246237650_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> >>> >>> On Sat, Sep 19, 2020 at 10:15 PM Shree Devi Kumar <shree...@gmail.com> >>> wrote: >>> >>>> > Each of my PNG files have file names that indicate ground truth, and >>>> I have a little script that generates ground-truth TXT files from the PNG >>>> file names. >>>> >>>> Please review your script. I notice a number of file names ending with >>>> -2. The gt.txt files for the same also contain -2 while the image only has >>>> the number. >>>> >>>> Example files attached. >>>> >>>> >>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> >>>> Virus-free. >>>> www.avg.com >>>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> >>>> <#m_3745996810865765477_m_-8209654746249460667_m_-4693331455246237650_m_2830491266519781149_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> >>>> >>> >>> >>> -- >>> >>> ____________________________________________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/70e5fed6-3035-4885-965c-0552560ef0f6n%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/70e5fed6-3035-4885-965c-0552560ef0f6n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUfsmr-%3DhaiYpBfmLRiWKtsLUr-7uDkeQsSi59k3hCSWg%40mail.gmail.com.