Hi Kristof, good work, I thought about it a few times. I gave a quick look, just a couple of quick notes, I'll try to read it better when I get time.
This thread about the font size is where I got the 30/40px indication: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/tesseract-ocr/Wdh_JJwnw94/xk2ErJnFBQAJ For my trainings (fine tuning) I used 48px (with 2px of white border, so text was about 44), maybe the size does not matter much if you do fine tuning, but I never did a precise comparison. Maybe 48 is even better. The white border probably was not important. One thing to keep in mind is that IMO there is not THE correct way to train because different fonts or different types of images (contrast, noise, etc.) may work best with different parameters. So you need to experiment a little with these if you want optimal results. This leads to the most important part: Am I done training? without this you are just wasting time. What I describe in this post is not completely correct due to the way ocrd works (I should discuss this on github so see if it should be fixed or not). https://groups.google.com/forum/#!msg/tesseract-ocr/COJ4IjcrL6s/C1OeE9bWBgAJ The basic idea of any machine learning training is this: split the data in two parts, use one for training and use the other to check the result. The idea is that if you train too much only on a few things you get exceptionally good on these but you overspecialize and get worse at all the rest (this is called overfitting). So you get 99.999% accuracy on the training and 74% on the eval set and real world data that is what really matters (real world is usually a little worse than eval). The problem I found is that ocrd recreates the files list.train and list.eval every time you run it (it was not designed for incremental training I think). So, if you follow my instructions, you'll mix the train and eval files and this is bad. So I modified the ocrd Makefile to create these two files explicitly at the beginning of the training (and only once). This is the edit (about line 80): # Create lists of lstmf filenames for training and eval #lists: $(ALL_LSTMF) data/list.train data/list.eval lists: $(ALL_LSTMF) train-lists: data/list.train data/list.eval Now you need to call "make train-lists" only once when you start a new training session with new data (not after each "iteration step"). Ocrd by default does a 90/10 split (RATIO_TRAIN := 0.90), if you have some data (1000/10000 samples) do a 80/20. If you have a ton (100k+ samples) of data 90/10 or evel 95/5 may be fine. About PSM. I did my training with PSM 6 but for one model (the most complex one, out of 8) I found that using PSM 13 when doing the recognition gives better results for punctation and other special characters. Again, I do not know how much difference the PSM param makes during training. From what I understand PSM 6 does some custom cleanup/preprocessing to the images, PSM 13 leaves them untouched (completely?). About the parameters you listed in your post: I know the meaning of a few of them but I think that in general they are quite useless (or you need to understand more to mess with them). What I mostly refer to is the output from lstmeval. char train and word train are the error on the recognition these are probably the only one to look at as a reference (but these refer to the training data, not the eval data). best char error is the best so far, the training is noisy and goes up and down. delta is probably the variation from the previous output and rms is root mean square of something. In other words you do not really understand all of them to do the training. One iteration means one image, so max_iterations should be at least equal to your images. If you have a ton of images you may see that you do not need to process all of them to reach the "saturation" point when extra training is useless, but normally you want to process all of them even a few times (until the eval score stabilize or get worse for a few iteration). One note: if you repeat the whole training multiple time (for example trying different image sizes) you need to keep aside the list.train/eval files otherwise you compare with a different set of eval images (and with a little data set this can make a big difference). Another note: while you fine tune (specialize) on a new "font(s)" you get a little worse on all the others. If you care about other fonts too you should check on them with lstmeval too. Bye Lorenzo Il giorno gio 7 feb 2019 alle ore 09:36 Kristóf Horváth < vazzzeg...@gmail.com> ha scritto: > Hi, i set out to make a newbie friendly guide and i already have some > stuff that might already help people, but its not complete yet. I would > like people to read it and where they can help out with comments. I left > places empty or left notes of my own pls feel free to figure out what > should be there. I really hope i didnt make big mistakes, but in case i did > write something stupid pls share it in form of a constructive criticism. > The following things are very unclear for me (in terms of what they > exactly represent): > > - radical-stroke.txt > - learning_rate > - noextract_font_properties > - 2 percent improvement > - time= > - best error was 100 @0 > - iteration 31/100/100 > - rms= > - delta= > - char train= > - word train= > - skip ratio= > - best char error= > > And finially here is the link > <https://docs.google.com/document/d/1qDqbnlptcCPVIvMOHwfNws-CQat-llZLOTHC6S94Vec/edit?usp=sharing>. > (Google docs should be in english, Im writing a wiki so formating is based > on wiki syntax, with the link you should be able to make comments) > In case you are really enthusiastic about it you can contact me for write > rights. > > Best Regards > Kristof Horvath > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/65c42f6a-3463-4290-905c-9dcc2d9caada%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/65c42f6a-3463-4290-905c-9dcc2d9caada%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLxVhHF%2Bb8RE4qODX5GjAbezFB7_U_QNfgxuzfRLvE%3D6tg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.