Re: [tesseract-ocr] understading lstmeval and use it on pretrained models for comparison

Shree Devi Kumar Fri, 19 Jul 2019 08:44:11 -0700

Very well written. You may want to update the wiki pages with the info too.


On Fri, Jul 19, 2019 at 7:45 PM Arno Loo <arno.laf...@gmail.com> wrote:

> I went and tried to understand the source code as well as I could and
> although I did not find all the answers I did find some. (for tesseract
> 4.0.0-beta.3)
> At iteration 14615/695400/698614, Mean rms=0.158%, delta=0.295%, char
>  train=1.882%, word train=2.285%, skip ratio=0.4%,  wrote checkpoint.
>
> In the above example,
> 14615 : learning_iteration
> 695400 : training_iteration
> 698614 : sample_iteration
>
> *sample_iteration* : "Index into training sample set. (sample_iteration
> >= training_iteration)." It is how many times a training file has been
> passed into the learning process
> *training_iteration* : "Number of actual backward training steps used."
> It is how many times a training file has been SUCCESSFULLY passed into the
> learning process
>
> So everytime you get an error : "Image too large to learn!!" - "Encoding
> of string failed!" - "Deserialize header failed", the sample_iteration
> increments but not the training_iteration.
> Actually you have 1 - (695400 / 698614) = 0.4% which is the *skip ratio* :
> proportion of files that have been skiped because of an error
>
> *learning_iteration* : "Number of iterations that yielded a non-zero
> delta error and thus provided significant learning. (learning_iteration <=
> training_iteration). learning_iteration_ is used to measure rate of
> learning progress."
> So it uses the *delta* value to assess it the iteration has been useful.
>
> What is good to know is that when you specify a maximum number of
> iteration to the training process it uses the middle iteration number
> (training_iteration) to know when to stop. But when it writes a checkpoint,
> the checkpoint name uses the smallest iteration number
> (learning_iteration). Along with the *char train* rate. So a checkpoint
> name is the concatenation of model_name & char_train & learning_iteration
>
> ------
>
> But there are still a lot of things I do not understand. And one of them
> is actually causing me an issue : even with a lot of iterations (475k) I
> still do not see any log message with the error on the evaluation set.
> At iteration 61235/475300/475526, Mean rms=0.521%, delta=2.073%, char
>  train=9.379%, word train=9.669%, skip ratio=0.1%,  New worst char error =
>  9.379 wrote checkpoint.
>
>
>
> Le vendredi 28 juin 2019 17:39:52 UTC+2, shree a écrit :
>>
>> Your best source for documentation is the source code. See
>>
>>
>> https://github.com/tesseract-ocr/tesseract/blob/f522b039a52ae0094fb928ac60a66c4ae0f6c5b9/src/training/lstmtrainer.cpp#L371
>>
>>
>>
>> https://github.com/tesseract-ocr/tesseract/blob/f522b039a52ae0094fb928ac60a66c4ae0f6c5b9/src/training/lstmtrainer.cpp#L382
>>
>>
>> On Fri, Jun 28, 2019 at 8:47 PM Arno Loo <arno....@gmail.com> wrote:
>>
>>> I continue to make experiments and trying to understand what seems
>>> important and I have a few questions after a research in Tesseract's wiki
>>>
>>> During the training we can see this kind of information :
>>> At iteration 100/100/100, Mean rms=4.514%, delta=19.089%, char train=
>>> 96.314%, word train=100%, skip ratio=0%,  New best char error = 96.314
>>> wrote checkpoint.
>>>
>>> - *100/100/100 :* What do this 3 numbers at the begining mean when they
>>> are different ? Which they are often, unlike in my example.
>>> - *Mean rms* I know well, it's the Root Mean Square error. But what
>>> error metric is used ? Usually it is some kind of distance, the Levenshtein
>>> distance is often appropriate for OCR tasks but the "%" wouldn't be there
>>> if it was.
>>> - *delta* I don't know
>>> - *char train *must be the percentage of wrong character predictions
>>> during the *training*
>>> - *word train *must be the percentage of wrong word predictions during
>>> the *training*
>>> - * skip ratio *is I think the percentage of samples skip for any
>>> reason (invalid data or something)
>>>
>>> Does anyone can help me understand them please ?
>>>
>>> Also, I do not see any error on evaluation during the training. Which
>>> would be really helpful to avoid overfitting. The only way I would know how
>>> to follow the *evaluation* error during the training would be to try a
>>> lstmeval on each checkpoint, but I think there must be a better way ?
>>> Otherwise the *--eval_listfile *argument would be useless in
>>> lstmtraining, but I can't find out how it is used.
>>>
>>> Thank you :)
>>>
>>> Le jeudi 27 juin 2019 19:17:46 UTC+2, shree a écrit :
>>>>
>>>> See
>>>> https://github.com/tesseract-ocr/tesseract/blob/master/doc/lstmeval.1.asc
>>>>
>>>> When using checkpoint you need to also use the starter traineddata file
>>>> used for training.
>>>>
>>>> Or give final traineddata file as model.
>>>>
>>>> So, if after training u have converted the checkpoint to a traineddata,
>>>> you can use that as model. Similarly for the original traineddata.
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesser...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/19f392d5-6d77-4830-93ff-c446d06df6fa%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/19f392d5-6d77-4830-93ff-c446d06df6fa%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> --
>>
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/75c48599-79c6-433b-822f-67e909570786%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/75c48599-79c6-433b-822f-67e909570786%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWyobqOAvcVE9UMXcH677XS1qeFYb_W%3DGe_j7p7g0cVew%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] understading lstmeval and use it on pretrained models for comparison

Reply via email to