Re: [tesseract-ocr] understading lstmeval and use it on pretrained models for comparison

ElGato ElMago Sun, 21 Jul 2019 18:44:24 -0700

Yes.  This is a very good write-up and helpful to traininers.

2019年7月20日土曜日 0時43分56秒 UTC+9 shree:
>
> Very well written. You may want to update the wiki pages with the info too.
>
> On Fri, Jul 19, 2019 at 7:45 PM Arno Loo <arno....@gmail.com <javascript:>> 
> wrote:
>
>> I went and tried to understand the source code as well as I could and 
>> although I did not find all the answers I did find some. (for tesseract 
>> 4.0.0-beta.3)
>> At iteration 14615/695400/698614, Mean rms=0.158%, delta=0.295%, char
>>  train=1.882%, word train=2.285%, skip ratio=0.4%,  wrote checkpoint.
>>
>> In the above example,
>> 14615 : learning_iteration
>> 695400 : training_iteration
>> 698614 : sample_iteration
>>
>> *sample_iteration* : "Index into training sample set. (sample_iteration 
>> >= training_iteration)." It is how many times a training file has been 
>> passed into the learning process
>> *training_iteration* : "Number of actual backward training steps used." 
>> It is how many times a training file has been SUCCESSFULLY passed into the 
>> learning process
>>
>> So everytime you get an error : "Image too large to learn!!" - "Encoding 
>> of string failed!" - "Deserialize header failed", the sample_iteration 
>> increments but not the training_iteration.
>> Actually you have 1 - (695400 / 698614) = 0.4% which is the *skip ratio* : 
>> proportion of files that have been skiped because of an error
>>
>> *learning_iteration* : "Number of iterations that yielded a non-zero 
>> delta error and thus provided significant learning. (learning_iteration <= 
>> training_iteration). learning_iteration_ is used to measure rate of 
>> learning progress."
>> So it uses the *delta* value to assess it the iteration has been useful.
>>
>> What is good to know is that when you specify a maximum number of 
>> iteration to the training process it uses the middle iteration number 
>> (training_iteration) to know when to stop. But when it writes a checkpoint, 
>> the checkpoint name uses the smallest iteration number 
>> (learning_iteration). Along with the *char train* rate. So a checkpoint 
>> name is the concatenation of model_name & char_train & learning_iteration
>>
>> ------
>>
>> But there are still a lot of things I do not understand. And one of them 
>> is actually causing me an issue : even with a lot of iterations (475k) I 
>> still do not see any log message with the error on the evaluation set.
>> At iteration 61235/475300/475526, Mean rms=0.521%, delta=2.073%, char
>>  train=9.379%, word train=9.669%, skip ratio=0.1%,  New worst char error 
>> = 9.379 wrote checkpoint.
>>
>>
>>
>> Le vendredi 28 juin 2019 17:39:52 UTC+2, shree a écrit :
>>>
>>> Your best source for documentation is the source code. See
>>>
>>>
>>> https://github.com/tesseract-ocr/tesseract/blob/f522b039a52ae0094fb928ac60a66c4ae0f6c5b9/src/training/lstmtrainer.cpp#L371
>>>  
>>>
>>>
>>> https://github.com/tesseract-ocr/tesseract/blob/f522b039a52ae0094fb928ac60a66c4ae0f6c5b9/src/training/lstmtrainer.cpp#L382
>>>  
>>>
>>> On Fri, Jun 28, 2019 at 8:47 PM Arno Loo <arno....@gmail.com> wrote:
>>>
>>>> I continue to make experiments and trying to understand what seems 
>>>> important and I have a few questions after a research in Tesseract's wiki
>>>>
>>>> During the training we can see this kind of information :
>>>> At iteration 100/100/100, Mean rms=4.514%, delta=19.089%, char train=
>>>> 96.314%, word train=100%, skip ratio=0%,  New best char error = 96.314 
>>>> wrote checkpoint.
>>>>
>>>> - *100/100/100 :* What do this 3 numbers at the begining mean when 
>>>> they are different ? Which they are often, unlike in my example.
>>>> - *Mean rms* I know well, it's the Root Mean Square error. But what 
>>>> error metric is used ? Usually it is some kind of distance, the 
>>>> Levenshtein 
>>>> distance is often appropriate for OCR tasks but the "%" wouldn't be there 
>>>> if it was.
>>>> - *delta* I don't know
>>>> - *char train *must be the percentage of wrong character predictions 
>>>> during the *training*
>>>> - *word train *must be the percentage of wrong word predictions during 
>>>> the *training*
>>>> - * skip ratio *is I think the percentage of samples skip for any 
>>>> reason (invalid data or something)
>>>>
>>>> Does anyone can help me understand them please ?
>>>>
>>>> Also, I do not see any error on evaluation during the training. Which 
>>>> would be really helpful to avoid overfitting. The only way I would know 
>>>> how 
>>>> to follow the *evaluation* error during the training would be to try a 
>>>> lstmeval on each checkpoint, but I think there must be a better way ? 
>>>> Otherwise the *--eval_listfile *argument would be useless in 
>>>> lstmtraining, but I can't find out how it is used.
>>>>
>>>> Thank you :)
>>>>
>>>> Le jeudi 27 juin 2019 19:17:46 UTC+2, shree a écrit :
>>>>>
>>>>> See 
>>>>> https://github.com/tesseract-ocr/tesseract/blob/master/doc/lstmeval.1.asc
>>>>>
>>>>> When using checkpoint you need to also use the starter traineddata 
>>>>> file used for training.
>>>>>
>>>>> Or give final traineddata file as model.
>>>>>
>>>>> So, if after training u have converted the checkpoint to a 
>>>>> traineddata, you can use that as model. Similarly for the original 
>>>>> traineddata.
>>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to tesser...@googlegroups.com.
>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/19f392d5-6d77-4830-93ff-c446d06df6fa%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/19f392d5-6d77-4830-93ff-c446d06df6fa%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>> -- 
>>>
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesser...@googlegroups.com <javascript:>.
>> To post to this group, send email to tesser...@googlegroups.com 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/75c48599-79c6-433b-822f-67e909570786%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/75c48599-79c6-433b-822f-67e909570786%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> -- 
>
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/73b069d3-c87b-4539-9344-f246d16b61de%40googlegroups.com.

Re: [tesseract-ocr] understading lstmeval and use it on pretrained models for comparison

Reply via email to