I went and tried to understand the source code as well as I could and although I did not find all the answers I did find some. (for tesseract 4.0.0-beta.3) At iteration 14615/695400/698614, Mean rms=0.158%, delta=0.295%, char train= 1.882%, word train=2.285%, skip ratio=0.4%, wrote checkpoint.
In the above example, 14615 : learning_iteration 695400 : training_iteration 698614 : sample_iteration *sample_iteration* : "Index into training sample set. (sample_iteration >= training_iteration)." It is how many times a training file has been passed into the learning process *training_iteration* : "Number of actual backward training steps used." It is how many times a training file has been SUCCESSFULLY passed into the learning process So everytime you get an error : "Image too large to learn!!" - "Encoding of string failed!" - "Deserialize header failed", the sample_iteration increments but not the training_iteration. Actually you have 1 - (695400 / 698614) = 0.4% which is the *skip ratio* : proportion of files that have been skiped because of an error *learning_iteration* : "Number of iterations that yielded a non-zero delta error and thus provided significant learning. (learning_iteration <= training_iteration). learning_iteration_ is used to measure rate of learning progress." So it uses the *delta* value to assess it the iteration has been useful. What is good to know is that when you specify a maximum number of iteration to the training process it uses the middle iteration number (training_iteration) to know when to stop. But when it writes a checkpoint, the checkpoint name uses the smallest iteration number (learning_iteration). Along with the *char train* rate. So a checkpoint name is the concatenation of model_name & char_train & learning_iteration ------ But there are still a lot of things I do not understand. And one of them is actually causing me an issue : even with a lot of iterations (475k) I still do not see any log message with the error on the evaluation set. At iteration 61235/475300/475526, Mean rms=0.521%, delta=2.073%, char train= 9.379%, word train=9.669%, skip ratio=0.1%, New worst char error = 9.379 wrote checkpoint. Le vendredi 28 juin 2019 17:39:52 UTC+2, shree a écrit : > > Your best source for documentation is the source code. See > > > https://github.com/tesseract-ocr/tesseract/blob/f522b039a52ae0094fb928ac60a66c4ae0f6c5b9/src/training/lstmtrainer.cpp#L371 > > > > https://github.com/tesseract-ocr/tesseract/blob/f522b039a52ae0094fb928ac60a66c4ae0f6c5b9/src/training/lstmtrainer.cpp#L382 > > > On Fri, Jun 28, 2019 at 8:47 PM Arno Loo <arno....@gmail.com <javascript:>> > wrote: > >> I continue to make experiments and trying to understand what seems >> important and I have a few questions after a research in Tesseract's wiki >> >> During the training we can see this kind of information : >> At iteration 100/100/100, Mean rms=4.514%, delta=19.089%, char train= >> 96.314%, word train=100%, skip ratio=0%, New best char error = 96.314 >> wrote checkpoint. >> >> - *100/100/100 :* What do this 3 numbers at the begining mean when they >> are different ? Which they are often, unlike in my example. >> - *Mean rms* I know well, it's the Root Mean Square error. But what >> error metric is used ? Usually it is some kind of distance, the Levenshtein >> distance is often appropriate for OCR tasks but the "%" wouldn't be there >> if it was. >> - *delta* I don't know >> - *char train *must be the percentage of wrong character predictions >> during the *training* >> - *word train *must be the percentage of wrong word predictions during >> the *training* >> - * skip ratio *is I think the percentage of samples skip for any reason >> (invalid data or something) >> >> Does anyone can help me understand them please ? >> >> Also, I do not see any error on evaluation during the training. Which >> would be really helpful to avoid overfitting. The only way I would know how >> to follow the *evaluation* error during the training would be to try a >> lstmeval on each checkpoint, but I think there must be a better way ? >> Otherwise the *--eval_listfile *argument would be useless in >> lstmtraining, but I can't find out how it is used. >> >> Thank you :) >> >> Le jeudi 27 juin 2019 19:17:46 UTC+2, shree a écrit : >>> >>> See >>> https://github.com/tesseract-ocr/tesseract/blob/master/doc/lstmeval.1.asc >>> >>> When using checkpoint you need to also use the starter traineddata file >>> used for training. >>> >>> Or give final traineddata file as model. >>> >>> So, if after training u have converted the checkpoint to a traineddata, >>> you can use that as model. Similarly for the original traineddata. >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesser...@googlegroups.com <javascript:>. >> To post to this group, send email to tesser...@googlegroups.com >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/19f392d5-6d77-4830-93ff-c446d06df6fa%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/19f392d5-6d77-4830-93ff-c446d06df6fa%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > > -- > > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/75c48599-79c6-433b-822f-67e909570786%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.