Re: [tesseract-ocr] understading lstmeval and use it on pretrained models for comparison

Arno Loo Fri, 19 Jul 2019 07:06:41 -0700

I went and tried to understand the source code as well as I could and 
although I did not find all the answers I did find some. (for tesseract 
4.0.0-beta.3)
At iteration 14615/695400/698614, Mean rms=0.158%, delta=0.295%, char train=
1.882%, word train=2.285%, skip ratio=0.4%,  wrote checkpoint.


In the above example,
14615 : learning_iteration
695400 : training_iteration
698614 : sample_iteration

*sample_iteration* : "Index into training sample set. (sample_iteration >= 
training_iteration)." It is how many times a training file has been passed 
into the learning process
*training_iteration* : "Number of actual backward training steps used." It 
is how many times a training file has been SUCCESSFULLY passed into the 
learning process

So everytime you get an error : "Image too large to learn!!" - "Encoding of 
string failed!" - "Deserialize header failed", the sample_iteration 
increments but not the training_iteration.
Actually you have 1 - (695400 - 698614) = 0.4% which is the *skip ratio* : 
proportion of files that have been skiped because of an error

*learning_iteration* : "Number of iterations that yielded a non-zero delta 
error and thus provided significant learning. (learning_iteration <= 
training_iteration). learning_iteration_ is used to measure rate of 
learning progress."
So it uses the *delta* value to assess it the iteration has been useful.

What is good to know is that when you specify a maximum number of iteration 
to the training process it uses the highest iteration number 
(sample_iteration) to know when to stop. But when it writes a checkpoint, 
the checkpoint name uses the smallest iteration number 
(learning_iteration). Along with the *char train* rate. So a checkpoint 
name is the concatenation of model_name & char_train & learning_iteration

------

But there are still a lot of things I do not understand. And one of them is 
actually causing me an issue : even with a lot of iterations (475k) I still 
do not see any log message with the error on the evaluation set.
At iteration 61235/475300/475526, Mean rms=0.521%, delta=2.073%, char train=
9.379%, word train=9.669%, skip ratio=0.1%,  New worst char error = 9.379 
wrote checkpoint.



Le vendredi 28 juin 2019 17:17:30 UTC+2, Arno Loo a écrit :
>
> I continue to make experiments and trying to understand what seems 
> important and I have a few questions after a research in Tesseract's wiki
>
> During the training we can see this kind of information :
> At iteration 100/100/100, Mean rms=4.514%, delta=19.089%, char train=
> 96.314%, word train=100%, skip ratio=0%,  New best char error = 96.314 
> wrote checkpoint.
>
> - *100/100/100 :* What do this 3 numbers at the begining mean when they 
> are different ? Which they are often, unlike in my example.
> - *Mean rms* I know well, it's the Root Mean Square error. But what error 
> metric is used ? Usually it is some kind of distance, the Levenshtein 
> distance is often appropriate for OCR tasks but the "%" wouldn't be there 
> if it was.
> - *delta* I don't know
> - *char train *must be the percentage of wrong character predictions 
> during the *training*
> - *word train *must be the percentage of wrong word predictions during 
> the *training*
> - * skip ratio *is I think the percentage of samples skip for any reason 
> (invalid data or something)
>
> Does anyone can help me understand them please ?
>
> Also, I do not see any error on evaluation during the training. Which 
> would be really helpful to avoid overfitting. The only way I would know how 
> to follow the *evaluation* error during the training would be to try a 
> lstmeval on each checkpoint, but I think there must be a better way ? 
> Otherwise the *--eval_listfile *argument would be useless in 
> lstmtraining, but I can't find out how it is used.
>
> Thank you :)
>
> Le jeudi 27 juin 2019 19:17:46 UTC+2, shree a écrit :
>>
>> See 
>> https://github.com/tesseract-ocr/tesseract/blob/master/doc/lstmeval.1.asc
>>
>> When using checkpoint you need to also use the starter traineddata file 
>> used for training.
>>
>> Or give final traineddata file as model.
>>
>> So, if after training u have converted the checkpoint to a traineddata, 
>> you can use that as model. Similarly for the original traineddata.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d0abe12a-6081-4c80-baa6-ca739db51862%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] understading lstmeval and use it on pretrained models for comparison

Reply via email to