Re: [tesseract-ocr] Making custom traineddata

Shree Devi Kumar Mon, 08 Apr 2019 10:16:08 -0700

If you can provide another 40-50 lines of training data (text file) I will
rerun the training



On Mon, 8 Apr 2019, 22:11 Jankees Korstanje, <seek...@gmail.com> wrote:

> Hi Shree,
>
> We have tried your traineddata file for MRZ and noticed that it does not
> detect the character X.
>
> Looking at
> https://github.com/Shreeshrii/tessdata_ocrb/blob/master/eng.MRZ.training_text
>
> We see that there are no X in there.
>
> In addition it might be good to add a couple of lines that are specific
> for IDs (starting with I) note they are all fake
>
> IDESPANH186495123456789X<<<<<<
> IXESPE002561410<0233181G<<<<<
> I<NLDIS2KX87214<<<<<<<<<<<<<<<
>
>
>
>
>
>
>
> On Wednesday, 5 September 2018 18:03:41 UTC+2, shree wrote:
>>
>> See https://github.com/Shreeshrii/tessdata_ocrb
>> for the files and traineddata.
>>
>>
>> On Wed, Sep 5, 2018 at 8:51 PM, Shree Devi Kumar <shree...@gmail.com>
>> wrote:
>>
>>> I think finetune will be a better option than training from scratch.
>>>
>>> Using a small training/test text - 40 lines, I get
>>>
>>> ---------------------------------
>>>
>>> + lstmeval --verbosity 0 --model /home/ubuntu/
>>> *tessdata_best/script/Latin.traineddata* --eval_listfile
>>> /home/ubuntu/tesstutorial/ocrb/eng.training_files.txt
>>> Loaded 40/40 pages (1-40) of document
>>> /home/ubuntu/tesstutorial/ocrb/eng.OCR-B_10_BT.exp0.lstmf
>>> Loaded 40/40 pages (1-40) of document
>>> /home/ubuntu/tesstutorial/ocrb/eng.OCR_B_MT.exp0.lstmf
>>> Warning: LSTMTrainer deserialized an LSTMRecognizer!
>>> At iteration 0, stage 0, *Eval Char error rate=0.73106061*, *Word error
>>> rate=13.75*
>>>
>>> ---------------------------------
>>>
>>> + lstmeval --verbosity 0 --model /home/ubuntu/
>>> *tessdata_best/eng.traineddata* --eval_listfile
>>> /home/ubuntu/tesstutorial/ocrb/eng.training_files.txt
>>> Loaded 40/40 pages (1-40) of document
>>> /home/ubuntu/tesstutorial/ocrb/eng.OCR-B_10_BT.exp0.lstmf
>>> Loaded 40/40 pages (1-40) of document
>>> /home/ubuntu/tesstutorial/ocrb/eng.OCR_B_MT.exp0.lstmf
>>> Warning: LSTMTrainer deserialized an LSTMRecognizer!
>>> At iteration 0, stage 0, *Eval Char error rate=47.444889, Word error
>>> rate=92.5*
>>>
>>>
>>> * --------------------------------- *
>>>
>>> *At iteration 16/410/410, Mean rms=0.236%, delta=0.131%, char
>>> train=0.448%, word train=3.659%, skip ratio=0%,  New best char error =
>>> 0.448 wrote checkpoint.*
>>>
>>> *Finished! Error rate = 0.448*
>>>
>>>
>>> * --------------------------------- *
>>>
>>>
>>> + lstmeval --model 
>>> /home/ubuntu/tesstutorial/ocrb_from_full/*ocrb_plus_checkpoint
>>> *--traineddata /home/ubuntu/tesstutorial/ocrb/eng/eng.traineddata
>>> --eval_listfile /home/ubuntu/tesstutorial/ocrb/eng.training_files.txt
>>> /home/ubuntu/tesstutorial/ocrb_from_full/ocrb_plus_checkpoint is not a
>>> recognition model, trying training checkpoint...
>>> Loaded 40/40 pages (1-40) of document
>>> /home/ubuntu/tesstutorial/ocrb/eng.OCR-B_10_BT.exp0.lstmf
>>> Loaded 40/40 pages (1-40) of document
>>> /home/ubuntu/tesstutorial/ocrb/eng.OCR_B_MT.exp0.lstmf
>>> At iteration 0, stage 0, *Eval Char error rate=0, Word error rate=0*
>>>
>>> ---------------------------------
>>>
>>> On Wed, Sep 5, 2018 at 1:55 PM, <kaminski...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> (I might butcher English grammar- you have been warned!)
>>>>
>>>>    For some time I'm trying to teach tesseract to read MRZ
>>>> codes.Unfortunately it's not going very well. I'm using the latest version
>>>> of tesseract (4.0) soI'mm trying to train it by lstm method. I've
>>>> managed to pull it off and got some custom traineddata samples but
>>>> effects of using them are... let's say slightly unsatisfying. In the matter
>>>> of fact they are not even remotely close to eng traineddata. I know
>>>> that there was mrz traineddata in the previous version of tesseract.
>>>>
>>>> I'm out of ideas how to improve accuracy, so I'll need your help guys.
>>>>
>>>> At first I thought I could use images, .txt files containing already
>>>> read data and font data to somehow make box files (basically you have
>>>> image and .txt containing everything read from the image). I was
>>>> disappointed when I realized that without manual correction of boxes
>>>> tesseract won't know how to apply them correctly. Of course I need
>>>> automated method do apply boxes (I can't use any GUI or something).
>>>>
>>>> At the moment I'm only using .txt files and these are steps I'm doing
>>>> (it's also good to mention that I'm trying to make it from scratch):
>>>> -Using .txt and font (OcrB) to create .tiff and box files using
>>>> text2image method
>>>> -Creating unicharset from all box files
>>>> -(it's optional but for the sake of it) I'm applyingunicharsetproperties
>>>>
>>>> -Getting trainneddata from unicharset, langdata and using custom
>>>> language as parameter
>>>> -Creating lstmf file by tesseract some .tiff output lstm.train
>>>> -Creating list of files to train
>>>> -Running lstm training with net spec [1,36,0,1 Ct3,3,16 Mp3,3 Lfys48
>>>> Lfx96 Lrx96 Lfx256 O1c111] and learning rate 20e-4
>>>> -At the end I'm using last checkpoint to create traineddata for usage.
>>>> Currently initial .txt files are randomly generated by me in program
>>>> in form of mrz code (samples included). I also tried to generate files
>>>> in form of mixed alphabet to get signs variety. I was using about 1000
>>>> samples to train it and it doesn't differ from using 100 samples.
>>>>
>>>> Also, I disabled dictionary in the OCR process to prevent tesseract
>>>> from treating whole MRZ code as a word.
>>>>
>>>> I might not understand some things despite reading a lot about this
>>>> topic, but I'm pretty sure that I'm doing training process correctly. Do
>>>> you have any tips how to improve training process? Consider pointing out
>>>> even dumbest things I could forget about.
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesser...@googlegroups.com.
>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/tesseract-ocr/b3b86804-5d86-4fac-a780-88a2ef4f2ba2%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/b3b86804-5d86-4fac-a780-88a2ef4f2ba2%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>
>>
>>
>> --
>>
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/a8ddadfc-ac03-4169-8de3-68da65910ba6%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/a8ddadfc-ac03-4169-8de3-68da65910ba6%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWz%3DT9vK6QSdLxU9-kErZ5ELtP5kAX6-az0SX%3DB-pO6-w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Making custom traineddata

Reply via email to