Re: [tesseract-ocr] Fine tuning existing model

Tairen Chen Thu, 02 May 2019 14:56:16 -0700

Hi, Lorenzo and Shree

     Thanks for your sharing.
     I am trying to repeat what you have done here. 
     I followed your posts and change the Makefile, but when I run $ make 
training,
     I got the following errors: 
           mkdir -p data/checkpoints
           lstmtraining \


  --continue_from     extracted/eng.lstm \
  --old_traineddata   extracted/eng.traineddata \
  --traineddata data/eng/eng.traineddata \
  --model_output data/checkpoints/eng \
  --debug_interval -1 \
  --train_listfile data/list.train \
  --eval_listfile data/list.eval \
  --sequential_training \
  --max_iterations 3000

Must provide a --traineddata see training wiki
Makefile:111: recipe for target 'data/checkpoints/eng_checkpoint' failed
make: *** [data/checkpoints/eng_checkpoint] Error 1


      However, I can manually run $lstmtraining   --traineddata  
data/eng/eng.traineddata   --continue_from   extracted/eng.lstm  
 --old_traineddata extracted/eng.traineddata   --model_output 
data/checkpoints/eng   --debug_interval -1   --train_listfile  
data/list.train   --eval_listfile  data/list.eval   --sequential_training  
 --max_iterations 3000
      
      I don't know where to change and I am new to Tesseract and same with 
Makefile. Please share your wisdom.
      Thank you!
All the best,
                            Tairen
     
On Friday, June 29, 2018 at 11:17:35 AM UTC-7, Lorenzo Blz wrote:
>
>
> I think I found the problem. Running directly the new Makefile I had this 
> error:
>
> make: *** No rule to make target 
> 'data/train/alexis_ruhe01_1852_0018_022.box', needed by 'data/all-boxes'.  
> Stop.
>
> The problem was a "-gt.txt" rather than a ".gt.txt" as in my train files. 
> Now I can run your script directly.
>
> I also replaced the eng.traineddata with the one from here:
>
> https://github.com/tesseract-ocr/tessdata_best
>
> and it's training correctly. (it works correctly even with the previous 
> model, from https://github.com/tesseract-ocr/tessdata).
>
>
>
> One more question: I wanted to check if the output character set of the 
> new and old model differ. I used:
>
> combine_tessdata -u eng.traineddata orig
>
> on both models and compared the unicharset files. I see that some 
> characters are missing and some others are added. It looks good. Is this 
> the correct way to check this?
>
> In this way can I train a model that, for example, only recognize 
> uppercase characters, or numbers, simply by providing only uppercase 
> training data? Or is there something else to configure?
>
>
> Thanks, bye
>
> Lorenzo
>
>
> 2018-06-29 18:27 GMT+02:00 Shree Devi Kumar <shree...@gmail.com 
> <javascript:>>:
>
>> You should be able to use the new makefile after you make changes for all 
>> the directory locations to match your setup. 
>>
>> Change the language from frk to eng, though the sample training text 
>> seems to be non-english. In which case it is better for you to use the 
>> appropriate language traineddata eg. tessdata_best/deu.traineddata for 
>> German.
>>
>> On Fri, Jun 29, 2018 at 9:03 PM Lorenzo Bolzani <l.bo...@gmail.com 
>> <javascript:>> wrote:
>>
>>> Hi Shree, thanks for your answer.
>>>
>>> I tried the script setting:
>>>
>>> TESSDATA=extracted                 # here I have the eng.lstm and 
>>> eng.trainedata
>>> LANGDATA=langdata-master     # all langdata downladed by OCR-D
>>>
>>> MODEL_NAME = eng
>>> CONTINUE_FROM = eng
>>>
>>>
>>> First I run the old Makefile to create the boxes.
>>>
>>> $ make training MODEL_NAME=eng
>>>
>>>
>>> I stop it as soon as the training starts:
>>>
>>> At iteration 400/400/400, Mean rms=6.657%, delta=40.765%, char 
>>> train=100.827%, word train=100%, skip ratio=0%,  New worst char error = 
>>> 100.827 wrote checkpoint.
>>>
>>>
>>> At iteration 500/500/500, Mean rms=6.644%, delta=40.423%, char 
>>> train=100.662%, word train=100%, skip ratio=0%,  New worst char error = 
>>> 100.662 wrote checkpoint.
>>>
>>> ^Cmake: *** Deleting file 'data/checkpoints/eng_checkpoint'
>>> Makefile:110: recipe for target 'data/checkpoints/eng_checkpoint' failed
>>> make: *** [data/checkpoints/eng_checkpoint] Interrupt
>>>
>>> Notice that the data/checkpoints/eng_checkpoint file is deleted, I do 
>>> not know if it is relevant or not.
>>>
>>>
>>> then I switch to the new one and I get this:
>>>
>>> $ make training
>>>
>>> mkdir -p data/checkpoints
>>> lstmtraining \
>>>   --continue_from   extracted/eng.lstm \
>>>   --old_traineddata extracted/eng.traineddata \
>>>   --traineddata data/eng/eng.traineddata \
>>>   --model_output data/checkpoints/eng \
>>>   --debug_interval -1 \
>>>   --train_listfile data/list.train \
>>>   --eval_listfile data/list.eval \
>>>   --sequential_training \
>>>   --max_iterations 3000
>>> Loaded file extracted/eng.lstm, unpacking...
>>> Warning: LSTMTrainer deserialized an LSTMRecognizer!
>>> Code range changed from 111 to 76!
>>> Num (Extended) outputs,weights in Series:
>>>   1,36,0,1:1, 0
>>> Num (Extended) outputs,weights in Series:
>>>   C3,3:9, 0
>>>   Ft16:16, 160
>>> Total weights = 160
>>>   [C3,3Ft16]:16, 160
>>>   Mp3,3:16, 0
>>>   Lfys64:64, 20736
>>>   Lfx96:96, 61824
>>>   Lrx96:96, 74112
>>>   Lfx512:512, 1247232
>>>   Fc76:76, 0
>>> Total weights = 1404064
>>> Previous null char=110 mapped to 75
>>> Continuing from extracted/eng.lstm
>>> Loaded 1/1 pages (1-1) of document 
>>> data/train/mueller_waldhornist_1821_0130_010.lstmf
>>> Loaded 1/1 pages (1-1) of document 
>>> data/train/bismarck_erinnerungen02_1898_0274_002.lstmf
>>> Loaded 1/1 pages (1-1) of document 
>>> data/train/spyri_heidi_1880_0062_005.lstmf
>>> Loaded 1/1 pages (1-1) of document 
>>> data/train/novalis_ofterdingen_1802_0210_001.lstmf
>>> Iteration 0: ALIGNED TRUTH : Sparoͤfen kauft' ich auch und Sorgenstuͤhle,
>>> Iteration 0: BEST OCR TEXT : l bd o D V fc ds ft hs D t' dsu PM )k ,„cGs 
>>> D t' D„Gs 'A AKG„9„t d tft ü!Vt Eb ht Ac )k uF ' K,cGPFVts
>>> File data/train/mueller_waldhornist_1821_0130_010.lstmf page 0 :
>>> !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244
>>> !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244
>>> Makefile:113: recipe for target 'data/checkpoints/eng_checkpoint' failed
>>> make: *** [data/checkpoints/eng_checkpoint] Segmentation fault
>>>
>>>
>>> What am I doing wrong?
>>>
>>>
>>>
>>> Lorenzo
>>>
>>> 2018-06-29 14:08 GMT+02:00 Shree Devi Kumar <shree...@gmail.com 
>>> <javascript:>>:
>>>
>>>> I modified the makefile for ocrd-train to do fine-tuning.  It is pasted 
>>>> below:
>>>>
>>>> export
>>>>
>>>> SHELL := /bin/bash
>>>> LOCAL := $(PWD)/usr
>>>> PATH := $(LOCAL)/bin:$(PATH)
>>>> HOME := /home/ubuntu
>>>> TESSDATA =  $(HOME)/tessdata_best
>>>> LANGDATA = $(HOME)/langdata
>>>>
>>>> # Name of the model to be built
>>>> MODEL_NAME = frk
>>>>
>>>> # Name of the model to continue from
>>>> CONTINUE_FROM = frk
>>>>
>>>> # Normalization Mode - see src/training/language_specific.sh for 
>>>> details 
>>>> NORM_MODE = 2
>>>>
>>>> # Tesseract model repo to use. Default: $(TESSDATA_REPO)
>>>> TESSDATA_REPO = _best
>>>>
>>>> # Train directory
>>>> TRAIN := data/train
>>>>
>>>> # BEGIN-EVAL makefile-parser --make-help Makefile
>>>>
>>>> help:
>>>> @echo ""
>>>> @echo "  Targets"
>>>> @echo ""
>>>> @echo "    unicharset       Create unicharset"
>>>> @echo "    lists            Create lists of lstmf filenames for 
>>>> training and eval"
>>>> @echo "    training         Start training"
>>>> @echo "    proto-model      Build the proto model"
>>>> @echo "    leptonica        Build leptonica"
>>>> @echo "    tesseract        Build tesseract"
>>>> @echo "    tesseract-langs  Download tesseract-langs"
>>>> @echo "    langdata         Download langdata"
>>>> @echo "    clean            Clean all generated files"
>>>> @echo ""
>>>> @echo "  Variables"
>>>> @echo ""
>>>> @echo "    MODEL_NAME         Name of the model to be built"
>>>> @echo "    CORES              No of cores to use for compiling 
>>>> leptonica/tesseract"
>>>> @echo "    LEPTONICA_VERSION  Leptonica version. Default: 
>>>> $(LEPTONICA_VERSION)"
>>>> @echo "    TESSERACT_VERSION  Tesseract commit. Default: 
>>>> $(TESSERACT_VERSION)"
>>>> @echo "    LANGDATA_VERSION   Tesseract langdata version. Default: 
>>>> $(LANGDATA_VERSION)"
>>>> @echo "    TESSDATA_REPO      Tesseract model repo to use. Default: 
>>>> $(TESSDATA_REPO)"
>>>> @echo "    TRAIN              Train directory"
>>>> @echo "    RATIO_TRAIN        Ratio of train / eval training data"
>>>>
>>>> # END-EVAL
>>>>
>>>> # Ratio of train / eval training data
>>>> RATIO_TRAIN := 0.90
>>>>
>>>> ALL_BOXES = data/all-boxes
>>>> ALL_LSTMF = data/all-lstmf
>>>>
>>>> # Create unicharset
>>>> unicharset: data/unicharset
>>>>
>>>> # Create lists of lstmf filenames for training and eval
>>>> lists: $(ALL_LSTMF) data/list.train data/list.eval
>>>>
>>>> data/list.train: $(ALL_LSTMF)
>>>> total=`cat $(ALL_LSTMF) | wc -l` \
>>>>    no=`echo "$$total * $(RATIO_TRAIN) / 1" | bc`; \
>>>>    head -n "$$no" $(ALL_LSTMF) > "$@"
>>>>
>>>> data/list.eval: $(ALL_LSTMF)
>>>> total=`cat $(ALL_LSTMF) | wc -l` \
>>>>    no=`echo "($$total - $$total * $(RATIO_TRAIN)) / 1" | bc`; \
>>>>    tail -n "+$$no" $(ALL_LSTMF) > "$@"
>>>>
>>>> # Start training
>>>> training: data/$(MODEL_NAME).traineddata
>>>>
>>>> data/unicharset: $(ALL_BOXES)
>>>> combine_tessdata -u $(TESSDATA)/$(CONTINUE_FROM).traineddata  
>>>> $(TESSDATA)/$(CONTINUE_FROM).
>>>> unicharset_extractor --output_unicharset "$(TRAIN)/my.unicharset" 
>>>> --norm_mode $(NORM_MODE) "$(ALL_BOXES)"
>>>> merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset 
>>>> $(TRAIN)/my.unicharset  "$@"
>>>> $(ALL_BOXES): $(sort $(patsubst %.tif,%.box,$(wildcard $(TRAIN)/*.tif)))
>>>> find $(TRAIN) -name '*.box' -exec cat {} \; > "$@"
>>>> $(TRAIN)/%.box: $(TRAIN)/%.tif $(TRAIN)/%-gt.txt
>>>> python generate_line_box.py -i "$(TRAIN)/$*.tif" -t 
>>>> "$(TRAIN)/$*-gt.txt" > "$@"
>>>>
>>>> $(ALL_LSTMF): $(sort $(patsubst %.tif,%.lstmf,$(wildcard 
>>>> $(TRAIN)/*.tif)))
>>>> find $(TRAIN) -name '*.lstmf' -exec echo {} \; | sort -R -o "$@"
>>>>
>>>> $(TRAIN)/%.lstmf: $(TRAIN)/%.box
>>>> tesseract $(TRAIN)/$*.tif $(TRAIN)/$*   --psm 6 lstm.train
>>>>
>>>> # Build the proto model
>>>> proto-model: data/$(MODEL_NAME)/$(MODEL_NAME).traineddata
>>>>
>>>> data/$(MODEL_NAME)/$(MODEL_NAME).traineddata: $(LANGDATA) 
>>>> data/unicharset
>>>> combine_lang_model \
>>>>   --input_unicharset data/unicharset \
>>>>   --script_dir $(LANGDATA) \
>>>>   --words $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).wordlist \
>>>>   --numbers $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).numbers \
>>>>   --puncs $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).punc \
>>>>   --output_dir data/ \
>>>>   --lang $(MODEL_NAME)
>>>>
>>>> data/checkpoints/$(MODEL_NAME)_checkpoint: unicharset lists proto-model
>>>> mkdir -p data/checkpoints
>>>> lstmtraining \
>>>>   --continue_from   $(TESSDATA)/$(CONTINUE_FROM).lstm \
>>>>   --old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \
>>>>   --traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \
>>>>   --model_output data/checkpoints/$(MODEL_NAME) \
>>>>   --debug_interval -1 \
>>>>   --train_listfile data/list.train \
>>>>   --eval_listfile data/list.eval \
>>>>   --sequential_training \
>>>>   --max_iterations 3000
>>>>
>>>> data/$(MODEL_NAME).traineddata: 
>>>> data/checkpoints/$(MODEL_NAME)_checkpoint
>>>> lstmtraining \
>>>> --stop_training \
>>>> --continue_from $^ \
>>>> --old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \
>>>> --traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \
>>>> --model_output $@
>>>>
>>>> # Clean all generated files
>>>> clean:
>>>> find data/train -name '*.box' -delete
>>>> find data/train -name '*.lstmf' -delete
>>>> rm -rf data/all-*
>>>> rm -rf data/list.*
>>>> rm -rf data/$(MODEL_NAME)
>>>> rm -rf data/unicharset
>>>> rm -rf data/checkpoints
>>>>
>>>> On Fri, Jun 29, 2018 at 5:31 PM Lorenzo Bolzani <l.bo...@gmail.com 
>>>> <javascript:>> wrote:
>>>>
>>>>> 
>>>>>
>>>>> Hi,
>>>>> I'm trying to do fine tuning of an existing model using line images 
>>>>> and text labels. I'm running this version:
>>>>>
>>>>> tesseract 4.0.0-beta.3-56-g5fda
>>>>>  leptonica-1.76.0
>>>>>   libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : 
>>>>> libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0
>>>>>  Found AVX2
>>>>>  Found AVX
>>>>>  Found SSE
>>>>>
>>>>>
>>>>>
>>>>> I used OCR-D to generate lstmf files for the demo data.
>>>>>
>>>>> If I run the make command it works fine. 
>>>>>
>>>>> make training MODEL_NAME=prova
>>>>>
>>>>> Now I isolated this command from the build:
>>>>>
>>>>> lstmtraining \
>>>>>   --traineddata data/prova/prova.traineddata \
>>>>>   --net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 
>>>>> O1c`head -n1 data/unicharset`]" \
>>>>>   --model_output data/checkpoints/prova \
>>>>>   --learning_rate 20e-4 \
>>>>>   --train_listfile data/list.train \
>>>>>   --eval_listfile data/list.eval \
>>>>>   --max_iterations 10000
>>>>>
>>>>> and it works fine.
>>>>>
>>>>> Now I'm trying to modify it to fine tune the existing eng model. I 
>>>>> made a few attempts, all ending into different errors (see the attached 
>>>>> file for full output).
>>>>>
>>>>> I used:
>>>>>
>>>>> combine_tessdata -e /usr/local/share/tessdata/eng.traineddata 
>>>>> extracted/eng.lstm
>>>>>
>>>>> to extract the eng.lstm model. 
>>>>>
>>>>> This seems to works but I'm not sure it is the correct.
>>>>>
>>>>> lstmtraining \
>>>>>   --continue_from  extracted/eng.lstm \
>>>>>   --traineddata data/prova/prova.traineddata \
>>>>>   --old_traineddata extracted/eng.traineddata \
>>>>>   --model_output data/checkpoints/prova \
>>>>>   --learning_rate 20e-4 \
>>>>>   --train_listfile data/list.train \
>>>>>   --eval_listfile data/list.eval \
>>>>>   --max_iterations 10000
>>>>>
>>>>> (extracted/eng.traineddata is just a copy of eng.traineddata)
>>>>>
>>>>>
>>>>> The training resume exactly with the RMS of prova_checkpoint (6%) so 
>>>>> it looks like it is training from that checkpoint, not the eng.lstm.
>>>>>
>>>>> Is this correct? What should I change?
>>>>> 
>>>>> I'm following this guide:
>>>>>
>>>>>
>>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters
>>>>>
>>>>> 
>>>>> I think continue_from and traineddata should refer to the eng model 
>>>>> and old_traineddata should point to prova.traineddata, but if I do that I 
>>>>> get a segmentation fault:
>>>>>
>>>>> [...]
>>>>> !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244
>>>>> !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244
>>>>> Segmentation fault
>>>>>
>>>>> What am I missing?
>>>>>
>>>>>
>>>>> Thanks, bye
>>>>>
>>>>> Lorenzo
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to tesser...@googlegroups.com <javascript:>.
>>>>> To post to this group, send email to tesser...@googlegroups.com 
>>>>> <javascript:>.
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyOJN31PdWQumXPO3JjuAc1Yz2BZYpMd4ftzBHgZkEaxA%40mail.gmail.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyOJN31PdWQumXPO3JjuAc1Yz2BZYpMd4ftzBHgZkEaxA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>
>>>> -- 
>>>>
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to tesser...@googlegroups.com <javascript:>.
>>>> To post to this group, send email to tesser...@googlegroups.com 
>>>> <javascript:>.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWe%3Dv9YvYAMTAzm9yNEFFtqjnxBVGDe9x4tQd1Pnjiwqw%40mail.gmail.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWe%3Dv9YvYAMTAzm9yNEFFtqjnxBVGDe9x4tQd1Pnjiwqw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesser...@googlegroups.com <javascript:>.
>>> To post to this group, send email to tesser...@googlegroups.com 
>>> <javascript:>.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwUVJOePiO98piAgbSoqyA1GOrs%2BDwEz%2BxY9LS8YQyi%3DQ%40mail.gmail.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwUVJOePiO98piAgbSoqyA1GOrs%2BDwEz%2BxY9LS8YQyi%3DQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> -- 
>>
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesser...@googlegroups.com <javascript:>.
>> To post to this group, send email to tesser...@googlegroups.com 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU0aF%3DKmDPf9V3925bYouhTF%3Dq_XM-Xo5R%3Dv-yC%3DBRrRA%40mail.gmail.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU0aF%3DKmDPf9V3925bYouhTF%3Dq_XM-Xo5R%3Dv-yC%3DBRrRA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4242cfd0-d808-492d-967c-06731cc39d00%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Fine tuning existing model

Reply via email to