Thank you for your reply Shreeshrii! Indeed finetune method is much much better solution for my problem. Thanks to your logs and data provided in repo I realized that I don't need to generate every single MRZ code separately (I'm sure it was mentioned somewhere <dummy>). In fact the process of making tiffs with boxes and then lstmf's was oddly long (also loading lines in form o pages takes much less time). Using merged data is now just a matter of seconds. I don't know if it affected accuracy but now I'm generating every code in one .txt file and then processing it.
I've managed to make my own trainneddata based on polish language and results are outstanding. Thank you very much! Usually I've avoided tesstrain.sh script and was trying to use my own just to customize the process and control it. When it's combining language model I've spotted that it's making some dawg files. Is it because I'm using already existing language data? If so how can i generate langdata myself for custom language. In this case documentation isn't so clear. I know that it's created by combine_lang_model based on wordlist(langdata). I don't need it at the time but I think it's good idea to clear that out if I'll need to do some training from scratch although I know it's rare case. Thank you for taking your time to solve my problem! :) -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/db6a0582-4372-489b-82ba-8cdd0301dbb8%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.