Very well written. You may want to update the wiki pages with the info too.
On Fri, Jul 19, 2019 at 7:45 PM Arno Loo <arno.laf...@gmail.com> wrote: > I went and tried to understand the source code as well as I could and > although I did not find all the answers I did find some. (for tesseract > 4.0.0-beta.3) > At iteration 14615/695400/698614, Mean rms=0.158%, delta=0.295%, char > train=1.882%, word train=2.285%, skip ratio=0.4%, wrote checkpoint. > > In the above example, > 14615 : learning_iteration > 695400 : training_iteration > 698614 : sample_iteration > > *sample_iteration* : "Index into training sample set. (sample_iteration > >= training_iteration)." It is how many times a training file has been > passed into the learning process > *training_iteration* : "Number of actual backward training steps used." > It is how many times a training file has been SUCCESSFULLY passed into the > learning process > > So everytime you get an error : "Image too large to learn!!" - "Encoding > of string failed!" - "Deserialize header failed", the sample_iteration > increments but not the training_iteration. > Actually you have 1 - (695400 / 698614) = 0.4% which is the *skip ratio* : > proportion of files that have been skiped because of an error > > *learning_iteration* : "Number of iterations that yielded a non-zero > delta error and thus provided significant learning. (learning_iteration <= > training_iteration). learning_iteration_ is used to measure rate of > learning progress." > So it uses the *delta* value to assess it the iteration has been useful. > > What is good to know is that when you specify a maximum number of > iteration to the training process it uses the middle iteration number > (training_iteration) to know when to stop. But when it writes a checkpoint, > the checkpoint name uses the smallest iteration number > (learning_iteration). Along with the *char train* rate. So a checkpoint > name is the concatenation of model_name & char_train & learning_iteration > > ------ > > But there are still a lot of things I do not understand. And one of them > is actually causing me an issue : even with a lot of iterations (475k) I > still do not see any log message with the error on the evaluation set. > At iteration 61235/475300/475526, Mean rms=0.521%, delta=2.073%, char > train=9.379%, word train=9.669%, skip ratio=0.1%, New worst char error = > 9.379 wrote checkpoint. > > > > Le vendredi 28 juin 2019 17:39:52 UTC+2, shree a écrit : >> >> Your best source for documentation is the source code. See >> >> >> https://github.com/tesseract-ocr/tesseract/blob/f522b039a52ae0094fb928ac60a66c4ae0f6c5b9/src/training/lstmtrainer.cpp#L371 >> >> >> >> https://github.com/tesseract-ocr/tesseract/blob/f522b039a52ae0094fb928ac60a66c4ae0f6c5b9/src/training/lstmtrainer.cpp#L382 >> >> >> On Fri, Jun 28, 2019 at 8:47 PM Arno Loo <arno....@gmail.com> wrote: >> >>> I continue to make experiments and trying to understand what seems >>> important and I have a few questions after a research in Tesseract's wiki >>> >>> During the training we can see this kind of information : >>> At iteration 100/100/100, Mean rms=4.514%, delta=19.089%, char train= >>> 96.314%, word train=100%, skip ratio=0%, New best char error = 96.314 >>> wrote checkpoint. >>> >>> - *100/100/100 :* What do this 3 numbers at the begining mean when they >>> are different ? Which they are often, unlike in my example. >>> - *Mean rms* I know well, it's the Root Mean Square error. But what >>> error metric is used ? Usually it is some kind of distance, the Levenshtein >>> distance is often appropriate for OCR tasks but the "%" wouldn't be there >>> if it was. >>> - *delta* I don't know >>> - *char train *must be the percentage of wrong character predictions >>> during the *training* >>> - *word train *must be the percentage of wrong word predictions during >>> the *training* >>> - * skip ratio *is I think the percentage of samples skip for any >>> reason (invalid data or something) >>> >>> Does anyone can help me understand them please ? >>> >>> Also, I do not see any error on evaluation during the training. Which >>> would be really helpful to avoid overfitting. The only way I would know how >>> to follow the *evaluation* error during the training would be to try a >>> lstmeval on each checkpoint, but I think there must be a better way ? >>> Otherwise the *--eval_listfile *argument would be useless in >>> lstmtraining, but I can't find out how it is used. >>> >>> Thank you :) >>> >>> Le jeudi 27 juin 2019 19:17:46 UTC+2, shree a écrit : >>>> >>>> See >>>> https://github.com/tesseract-ocr/tesseract/blob/master/doc/lstmeval.1.asc >>>> >>>> When using checkpoint you need to also use the starter traineddata file >>>> used for training. >>>> >>>> Or give final traineddata file as model. >>>> >>>> So, if after training u have converted the checkpoint to a traineddata, >>>> you can use that as model. Similarly for the original traineddata. >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesser...@googlegroups.com. >>> To post to this group, send email to tesser...@googlegroups.com. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/19f392d5-6d77-4830-93ff-c446d06df6fa%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/19f392d5-6d77-4830-93ff-c446d06df6fa%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> >> -- >> >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/75c48599-79c6-433b-822f-67e909570786%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/75c48599-79c6-433b-822f-67e909570786%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWyobqOAvcVE9UMXcH677XS1qeFYb_W%3DGe_j7p7g0cVew%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.