It works !! I modified your bash script and executed it. Finally I get different traineddata size.
But, can I train it from scratch? It needs starting traineddata which I can get from combine_lang_model, isn't it? On Tuesday, January 9, 2018 at 7:36:08 PM UTC+7, shree wrote: > > >> My reason for using combine_lang_data is to make my punc, wordlist, and >> numbers effects the trainned data.. Or, it doesn't work like that? >> > > If you update the files in langdata folder and then run tesstrain.sh, it > will automatically use your files. > > >> >> Now, I will try your shell script for training, and will share the result >> if its done >> > > You will need to modify it according to the location of your files. > > Also, update the fonts list as per your requirements. > > >> >> >> On Tuesday, January 9, 2018 at 6:17:40 PM UTC+7, shree wrote: >>> >>> 1. If you use tesstrain.sh, it will create the starter traineddata, you >>> do NOT need to run combine_lang_data. If you want to change version string, >>> look at tesstrain_utils.sh and modify the command in it. >>> >>> 2. If you are always getting the same size file, it looks like that you >>> are probably copying some old file as traineddata as part of your script. >>> It could be copying from a wrong folder or some such thing. >>> >>> I am attaching a bash script, you can modify it for your setup and try >>> if that helps. >>> >>> ShreeDevi >>> ____________________________________________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>> >>> On Tue, Jan 9, 2018 at 9:39 AM, <easyma...@gmail.com> wrote: >>> >>>> Yes, I did the following command in tesseract/training directory: >>>> >>>> lstmtraining --stop_training --continue_from >>>> ../result/mylangoutput/base_checkpoint --traineddata >>>> ../result/mylangcombine/mylang/mylang.traineddata --model_output >>>> ../result/mylangoutput/mylang.traineddata >>>> >>>> On Monday, January 8, 2018 at 7:36:50 PM UTC+7, shree wrote: >>>>> >>>>> Did you use --stop_training flag at the end? >>>>> >>>>> ShreeDevi >>>>> ____________________________________________________________ >>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>> >>>>> On Mon, Jan 8, 2018 at 5:51 PM, <easyma...@gmail.com> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> I am doing my project using Tesseract v4.00, and always getting the >>>>>> traineddata output in the same size after training with my own data. >>>>>> I suppose that I did not do the steps correctly.. >>>>>> >>>>>> The only data that I provided were: >>>>>> 1. training_text >>>>>> 2. puncs (I just reduced the general punc as provided in tesseract >>>>>> github) >>>>>> 3. numbers >>>>>> 4. wordlists (I made various wordlists for several training, ranging >>>>>> between 100.000 - 2.000.000) >>>>>> 5. font name (I also made various fonts for several training, ranging >>>>>> between 1 - 20 fonts) >>>>>> >>>>>> The steps that I did were: >>>>>> 1. Made tiff file, unicharset and other complement data using >>>>>> tesstrain.sh >>>>>> 2. Made tiff file, unicharset and other complement data using >>>>>> tesstrain.sh for evaluation >>>>>> 3. Combined unicharset, wordlists, puncs, numbers and version_str to >>>>>> create started traineddata using combine_lang_data ( I am still not >>>>>> confident with the value of version_str though) >>>>>> 4. Trained data using lstmtraining >>>>>> 5. Combined all output file using lstmtraining --continue_from ... >>>>>> >>>>>> Yet, all of my training ended with same size which is 10.5MB.. >>>>>> Did I do all my steps correctly? >>>>>> >>>>>> Once, I also trained with modifying WORD_DAWG_FACTOR in >>>>>> language_spesific.sh to 0 and 1, because I want to read the text and >>>>>> match >>>>>> 100% with my wordlists. But, the result also did not satisfy me, some >>>>>> words >>>>>> are not in my wordlists such as "USISUSISU". >>>>>> Do you know whats the cause? >>>>>> >>>>>> I really appreciate if anyone can help or suggest any solution. >>>>>> Thankyou !! >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to tesseract-oc...@googlegroups.com. >>>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/tesseract-ocr/b6ca74b2-1e50-44cb-93f6-586fcd26cec5%40googlegroups.com >>>>>> >>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/b6ca74b2-1e50-44cb-93f6-586fcd26cec5%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>> >>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesseract-oc...@googlegroups.com. >>>> To post to this group, send email to tesser...@googlegroups.com. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/8ef2e463-9fd8-48c2-9498-19fb2cd32628%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/8ef2e463-9fd8-48c2-9498-19fb2cd32628%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesseract-oc...@googlegroups.com <javascript:>. >> To post to this group, send email to tesser...@googlegroups.com >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/d150b2f7-4cbf-49cc-a958-19f863de7ddc%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/d150b2f7-4cbf-49cc-a958-19f863de7ddc%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/55a753fe-8713-4934-93a6-76f1e256c50d%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.