Re: [tesseract-ocr] Traineddata always ended in same size and did not match with wordlist

easymavinmind Wed, 10 Jan 2018 02:26:49 -0800

It works !!
I modified your bash script and executed it. Finally I get different 
traineddata size.


But, can I train it from scratch?
It needs starting traineddata which I can get from combine_lang_model, 
isn't it?

 
On Tuesday, January 9, 2018 at 7:36:08 PM UTC+7, shree wrote:
>
>
>> My reason for using combine_lang_data is to make my punc, wordlist, and 
>> numbers effects the trainned data.. Or, it doesn't work like that?
>>
>
> If you update the files in langdata folder and then run tesstrain.sh, it 
> will automatically use your files.
> 
>
>>
>> Now, I will try your shell script for training, and will share the result 
>> if its done 
>>
>
> You will need to modify it according to the location of your files.
>
> Also, update the fonts list as per your requirements.
> 
>
>>
>>
>> On Tuesday, January 9, 2018 at 6:17:40 PM UTC+7, shree wrote:
>>>
>>> 1. If you use tesstrain.sh, it will create the starter traineddata, you 
>>> do NOT need to run combine_lang_data. If you want to change version string, 
>>> look at tesstrain_utils.sh and modify the command in it.
>>>
>>> 2. If you are always getting the same size file, it looks like that you 
>>> are probably copying some old file as traineddata as part of your script. 
>>> It could be copying from a wrong folder or some such thing.
>>>
>>> I am attaching a bash script, you can modify it for your setup and try 
>>> if that helps.
>>>
>>> ShreeDevi
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Tue, Jan 9, 2018 at 9:39 AM, <easyma...@gmail.com> wrote:
>>>
>>>> Yes, I did the following command in tesseract/training directory:
>>>>
>>>> lstmtraining --stop_training --continue_from 
>>>> ../result/mylangoutput/base_checkpoint --traineddata 
>>>> ../result/mylangcombine/mylang/mylang.traineddata --model_output 
>>>> ../result/mylangoutput/mylang.traineddata
>>>>
>>>> On Monday, January 8, 2018 at 7:36:50 PM UTC+7, shree wrote:
>>>>>
>>>>> Did you use --stop_training flag at the end?
>>>>>
>>>>> ShreeDevi
>>>>> ____________________________________________________________
>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>
>>>>> On Mon, Jan 8, 2018 at 5:51 PM, <easyma...@gmail.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I am doing my project using Tesseract v4.00, and always getting the 
>>>>>> traineddata output in the same size after training with my own data.
>>>>>> I suppose that I did not do the steps correctly..
>>>>>>
>>>>>> The only data that I provided were:
>>>>>> 1. training_text
>>>>>> 2. puncs (I just reduced the general punc as provided in tesseract 
>>>>>> github)
>>>>>> 3. numbers
>>>>>> 4. wordlists (I made various wordlists for several training, ranging 
>>>>>> between 100.000 - 2.000.000) 
>>>>>> 5. font name (I also made various fonts for several training, ranging 
>>>>>> between 1 - 20 fonts)
>>>>>>
>>>>>> The steps that I did were:
>>>>>> 1. Made tiff file, unicharset and other complement data using 
>>>>>> tesstrain.sh
>>>>>> 2. Made tiff file, unicharset and other complement data using 
>>>>>> tesstrain.sh for evaluation
>>>>>> 3. Combined unicharset, wordlists, puncs, numbers and version_str to 
>>>>>> create started traineddata using combine_lang_data ( I am still not 
>>>>>> confident with the value of version_str though)
>>>>>> 4. Trained data using lstmtraining
>>>>>> 5. Combined all output file using lstmtraining --continue_from ...
>>>>>>
>>>>>> Yet, all of my training ended with same size which is 10.5MB..
>>>>>> Did I do all my steps correctly?
>>>>>>
>>>>>> Once, I also trained with modifying WORD_DAWG_FACTOR in 
>>>>>> language_spesific.sh to 0 and 1, because I want to read the text and 
>>>>>> match 
>>>>>> 100% with my wordlists. But, the result also did not satisfy me, some 
>>>>>> words 
>>>>>> are not in my wordlists such as "USISUSISU".
>>>>>> Do you know whats the cause?
>>>>>>
>>>>>> I really appreciate if anyone can help or suggest any solution.
>>>>>> Thankyou !!
>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to tesseract-oc...@googlegroups.com.
>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/b6ca74b2-1e50-44cb-93f6-586fcd26cec5%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/b6ca74b2-1e50-44cb-93f6-586fcd26cec5%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>
>>>>>
>>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/8ef2e463-9fd8-48c2-9498-19fb2cd32628%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/8ef2e463-9fd8-48c2-9498-19fb2cd32628%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com <javascript:>.
>> To post to this group, send email to tesser...@googlegroups.com 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/d150b2f7-4cbf-49cc-a958-19f863de7ddc%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/d150b2f7-4cbf-49cc-a958-19f863de7ddc%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/55a753fe-8713-4934-93a6-76f1e256c50d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Traineddata always ended in same size and did not match with wordlist

Reply via email to