Re: [tesseract-ocr] Making custom traineddata

Shree Devi Kumar Thu, 06 Sep 2018 04:56:44 -0700

> When it's combining language model I've spotted that it's making some
dawg files.


Yes, it takes the files from langdata repo specified in the training
command.

You could change langdata/pol/pol.wordlist to have only the LAST NAMES and
GIVEN NAMES, pol.punc to have only < and change number formats in
pol.numbers to the MRZ number patterns (i.e. any required customizations
based on your use set).

I am not sure how much the dawgs help with the LSTM engine, but you can try
after customizing to see if you get improved results.

On Thu, Sep 6, 2018 at 4:23 PM, <kaminski.robert...@gmail.com> wrote:

> Thank you for your reply Shreeshrii!
>
> Indeed finetune method is much much better solution for my problem. Thanks
> to your logs and data provided in repo I realized that I don't need to
> generate every single MRZ code separately (I'm sure it was mentioned
> somewhere <dummy>). In fact the process of making tiffs with boxes and then
> lstmf's was oddly long (also loading lines in form o pages takes much less
> time). Using merged data is now just a matter of seconds. I don't know if
> it affected accuracy but now I'm generating every code in one .txt file and
> then processing it.
>
> I've managed to make my own trainneddata based on polish language and
> results are outstanding. Thank you very much!
>
> Usually I've avoided tesstrain.sh script and was trying to use my own just
> to customize the process and control it. When it's combining language model
> I've spotted that it's making some dawg files. Is it because I'm using
> already existing language data? If so how can i generate langdata myself
> for custom language. In this case documentation isn't so clear. I know that
> it's created by combine_lang_model based on wordlist(langdata). I don't
> need it at the time but I think it's good idea to clear that out if I'll
> need to do some training from scratch although I know it's rare case.
>
> Thank you for taking your time to solve my problem! :)
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/db6a0582-4372-489b-82ba-8cdd0301dbb8%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/db6a0582-4372-489b-82ba-8cdd0301dbb8%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>



-- 

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUCfV5LrfSqxDZh%3DZV5rsTxPXT0cDtiizBhvjnfkvq%2Bfg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Making custom traineddata

Reply via email to