I think finetune will be a better option than training from scratch. Using a small training/test text - 40 lines, I get
--------------------------------- + lstmeval --verbosity 0 --model /home/ubuntu/ *tessdata_best/script/Latin.traineddata* --eval_listfile /home/ubuntu/tesstutorial/ocrb/eng.training_files.txt Loaded 40/40 pages (1-40) of document /home/ubuntu/tesstutorial/ocrb/eng.OCR-B_10_BT.exp0.lstmf Loaded 40/40 pages (1-40) of document /home/ubuntu/tesstutorial/ocrb/eng.OCR_B_MT.exp0.lstmf Warning: LSTMTrainer deserialized an LSTMRecognizer! At iteration 0, stage 0, *Eval Char error rate=0.73106061*, *Word error rate=13.75* --------------------------------- + lstmeval --verbosity 0 --model /home/ubuntu/ *tessdata_best/eng.traineddata* --eval_listfile /home/ubuntu/tesstutorial/ocrb/eng.training_files.txt Loaded 40/40 pages (1-40) of document /home/ubuntu/tesstutorial/ocrb/eng.OCR-B_10_BT.exp0.lstmf Loaded 40/40 pages (1-40) of document /home/ubuntu/tesstutorial/ocrb/eng.OCR_B_MT.exp0.lstmf Warning: LSTMTrainer deserialized an LSTMRecognizer! At iteration 0, stage 0, *Eval Char error rate=47.444889, Word error rate=92.5* * --------------------------------- * *At iteration 16/410/410, Mean rms=0.236%, delta=0.131%, char train=0.448%, word train=3.659%, skip ratio=0%, New best char error = 0.448 wrote checkpoint.* *Finished! Error rate = 0.448* * --------------------------------- * + lstmeval --model /home/ubuntu/tesstutorial/ocrb_from_full/*ocrb_plus_checkpoint *--traineddata /home/ubuntu/tesstutorial/ocrb/eng/eng.traineddata --eval_listfile /home/ubuntu/tesstutorial/ocrb/eng.training_files.txt /home/ubuntu/tesstutorial/ocrb_from_full/ocrb_plus_checkpoint is not a recognition model, trying training checkpoint... Loaded 40/40 pages (1-40) of document /home/ubuntu/tesstutorial/ocrb/eng.OCR-B_10_BT.exp0.lstmf Loaded 40/40 pages (1-40) of document /home/ubuntu/tesstutorial/ocrb/eng.OCR_B_MT.exp0.lstmf At iteration 0, stage 0, *Eval Char error rate=0, Word error rate=0* --------------------------------- On Wed, Sep 5, 2018 at 1:55 PM, <kaminski.robert...@gmail.com> wrote: > Hi, > > (I might butcher English grammar- you have been warned!) > > For some time I'm trying to teach tesseract to read MRZ > codes.Unfortunately it's not going very well. I'm using the latest version > of tesseract (4.0) soI'mm trying to train it by lstm method. I've managed > to pull it off and got some custom traineddata samples but effects of > using them are... let's say slightly unsatisfying. In the matter of fact > they are not even remotely close to eng traineddata. I know that there > was mrz traineddata in the previous version of tesseract. > > I'm out of ideas how to improve accuracy, so I'll need your help guys. > > At first I thought I could use images, .txt files containing already read > data and font data to somehow make box files (basically you have image and > .txt containing everything read from the image). I was disappointed when > I realized that without manual correction of boxes tesseract won't know how > to apply them correctly. Of course I need automated method do apply boxes > (I can't use any GUI or something). > > At the moment I'm only using .txt files and these are steps I'm doing > (it's also good to mention that I'm trying to make it from scratch): > -Using .txt and font (OcrB) to create .tiff and box files using > text2image method > -Creating unicharset from all box files > -(it's optional but for the sake of it) I'm applyingunicharsetproperties > -Getting trainneddata from unicharset, langdata and using custom language > as parameter > -Creating lstmf file by tesseract some .tiff output lstm.train > -Creating list of files to train > -Running lstm training with net spec [1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 > Lfx96 Lrx96 Lfx256 O1c111] and learning rate 20e-4 > -At the end I'm using last checkpoint to create traineddata for usage. > Currently initial .txt files are randomly generated by me in program in > form of mrz code (samples included). I also tried to generate files in > form of mixed alphabet to get signs variety. I was using about 1000 samples > to train it and it doesn't differ from using 100 samples. > > Also, I disabled dictionary in the OCR process to prevent tesseract from > treating whole MRZ code as a word. > > I might not understand some things despite reading a lot about this topic, > but I'm pretty sure that I'm doing training process correctly. Do you have > any tips how to improve training process? Consider pointing out even > dumbest things I could forget about. > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/b3b86804-5d86-4fac-a780-88a2ef4f2ba2% > 40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/b3b86804-5d86-4fac-a780-88a2ef4f2ba2%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXd6vby2UgEtOQ3V7Nn53vtP5ZmXaHPchbF6ayoARhDUg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.