Hi, Lorenzo and Shree Thanks for your sharing. I am trying to repeat what you have done here. I followed your posts and change the Makefile, but when I run $ make training, I got the following errors: mkdir -p data/checkpoints lstmtraining \
--continue_from extracted/eng.lstm \ --old_traineddata extracted/eng.traineddata \ --traineddata data/eng/eng.traineddata \ --model_output data/checkpoints/eng \ --debug_interval -1 \ --train_listfile data/list.train \ --eval_listfile data/list.eval \ --sequential_training \ --max_iterations 3000 Must provide a --traineddata see training wiki Makefile:111: recipe for target 'data/checkpoints/eng_checkpoint' failed make: *** [data/checkpoints/eng_checkpoint] Error 1 However, I can manually run $lstmtraining --traineddata data/eng/eng.traineddata --continue_from extracted/eng.lstm --old_traineddata extracted/eng.traineddata --model_output data/checkpoints/eng --debug_interval -1 --train_listfile data/list.train --eval_listfile data/list.eval --sequential_training --max_iterations 3000 I don't know where to change and I am new to Tesseract and same with Makefile. Please share your wisdom. Thank you! All the best, Tairen On Friday, June 29, 2018 at 11:17:35 AM UTC-7, Lorenzo Blz wrote: > > > I think I found the problem. Running directly the new Makefile I had this > error: > > make: *** No rule to make target > 'data/train/alexis_ruhe01_1852_0018_022.box', needed by 'data/all-boxes'. > Stop. > > The problem was a "-gt.txt" rather than a ".gt.txt" as in my train files. > Now I can run your script directly. > > I also replaced the eng.traineddata with the one from here: > > https://github.com/tesseract-ocr/tessdata_best > > and it's training correctly. (it works correctly even with the previous > model, from https://github.com/tesseract-ocr/tessdata). > > > > One more question: I wanted to check if the output character set of the > new and old model differ. I used: > > combine_tessdata -u eng.traineddata orig > > on both models and compared the unicharset files. I see that some > characters are missing and some others are added. It looks good. Is this > the correct way to check this? > > In this way can I train a model that, for example, only recognize > uppercase characters, or numbers, simply by providing only uppercase > training data? Or is there something else to configure? > > > Thanks, bye > > Lorenzo > > > 2018-06-29 18:27 GMT+02:00 Shree Devi Kumar <shree...@gmail.com > <javascript:>>: > >> You should be able to use the new makefile after you make changes for all >> the directory locations to match your setup. >> >> Change the language from frk to eng, though the sample training text >> seems to be non-english. In which case it is better for you to use the >> appropriate language traineddata eg. tessdata_best/deu.traineddata for >> German. >> >> On Fri, Jun 29, 2018 at 9:03 PM Lorenzo Bolzani <l.bo...@gmail.com >> <javascript:>> wrote: >> >>> Hi Shree, thanks for your answer. >>> >>> I tried the script setting: >>> >>> TESSDATA=extracted # here I have the eng.lstm and >>> eng.trainedata >>> LANGDATA=langdata-master # all langdata downladed by OCR-D >>> >>> MODEL_NAME = eng >>> CONTINUE_FROM = eng >>> >>> >>> First I run the old Makefile to create the boxes. >>> >>> $ make training MODEL_NAME=eng >>> >>> >>> I stop it as soon as the training starts: >>> >>> At iteration 400/400/400, Mean rms=6.657%, delta=40.765%, char >>> train=100.827%, word train=100%, skip ratio=0%, New worst char error = >>> 100.827 wrote checkpoint. >>> >>> >>> At iteration 500/500/500, Mean rms=6.644%, delta=40.423%, char >>> train=100.662%, word train=100%, skip ratio=0%, New worst char error = >>> 100.662 wrote checkpoint. >>> >>> ^Cmake: *** Deleting file 'data/checkpoints/eng_checkpoint' >>> Makefile:110: recipe for target 'data/checkpoints/eng_checkpoint' failed >>> make: *** [data/checkpoints/eng_checkpoint] Interrupt >>> >>> Notice that the data/checkpoints/eng_checkpoint file is deleted, I do >>> not know if it is relevant or not. >>> >>> >>> then I switch to the new one and I get this: >>> >>> $ make training >>> >>> mkdir -p data/checkpoints >>> lstmtraining \ >>> --continue_from extracted/eng.lstm \ >>> --old_traineddata extracted/eng.traineddata \ >>> --traineddata data/eng/eng.traineddata \ >>> --model_output data/checkpoints/eng \ >>> --debug_interval -1 \ >>> --train_listfile data/list.train \ >>> --eval_listfile data/list.eval \ >>> --sequential_training \ >>> --max_iterations 3000 >>> Loaded file extracted/eng.lstm, unpacking... >>> Warning: LSTMTrainer deserialized an LSTMRecognizer! >>> Code range changed from 111 to 76! >>> Num (Extended) outputs,weights in Series: >>> 1,36,0,1:1, 0 >>> Num (Extended) outputs,weights in Series: >>> C3,3:9, 0 >>> Ft16:16, 160 >>> Total weights = 160 >>> [C3,3Ft16]:16, 160 >>> Mp3,3:16, 0 >>> Lfys64:64, 20736 >>> Lfx96:96, 61824 >>> Lrx96:96, 74112 >>> Lfx512:512, 1247232 >>> Fc76:76, 0 >>> Total weights = 1404064 >>> Previous null char=110 mapped to 75 >>> Continuing from extracted/eng.lstm >>> Loaded 1/1 pages (1-1) of document >>> data/train/mueller_waldhornist_1821_0130_010.lstmf >>> Loaded 1/1 pages (1-1) of document >>> data/train/bismarck_erinnerungen02_1898_0274_002.lstmf >>> Loaded 1/1 pages (1-1) of document >>> data/train/spyri_heidi_1880_0062_005.lstmf >>> Loaded 1/1 pages (1-1) of document >>> data/train/novalis_ofterdingen_1802_0210_001.lstmf >>> Iteration 0: ALIGNED TRUTH : Sparoͤfen kauft' ich auch und Sorgenstuͤhle, >>> Iteration 0: BEST OCR TEXT : l bd o D V fc ds ft hs D t' dsu PM )k ,„cGs >>> D t' D„Gs 'A AKG„9„t d tft ü!Vt Eb ht Ac )k uF ' K,cGPFVts >>> File data/train/mueller_waldhornist_1821_0130_010.lstmf page 0 : >>> !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244 >>> !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244 >>> Makefile:113: recipe for target 'data/checkpoints/eng_checkpoint' failed >>> make: *** [data/checkpoints/eng_checkpoint] Segmentation fault >>> >>> >>> What am I doing wrong? >>> >>> >>> >>> Lorenzo >>> >>> 2018-06-29 14:08 GMT+02:00 Shree Devi Kumar <shree...@gmail.com >>> <javascript:>>: >>> >>>> I modified the makefile for ocrd-train to do fine-tuning. It is pasted >>>> below: >>>> >>>> export >>>> >>>> SHELL := /bin/bash >>>> LOCAL := $(PWD)/usr >>>> PATH := $(LOCAL)/bin:$(PATH) >>>> HOME := /home/ubuntu >>>> TESSDATA = $(HOME)/tessdata_best >>>> LANGDATA = $(HOME)/langdata >>>> >>>> # Name of the model to be built >>>> MODEL_NAME = frk >>>> >>>> # Name of the model to continue from >>>> CONTINUE_FROM = frk >>>> >>>> # Normalization Mode - see src/training/language_specific.sh for >>>> details >>>> NORM_MODE = 2 >>>> >>>> # Tesseract model repo to use. Default: $(TESSDATA_REPO) >>>> TESSDATA_REPO = _best >>>> >>>> # Train directory >>>> TRAIN := data/train >>>> >>>> # BEGIN-EVAL makefile-parser --make-help Makefile >>>> >>>> help: >>>> @echo "" >>>> @echo " Targets" >>>> @echo "" >>>> @echo " unicharset Create unicharset" >>>> @echo " lists Create lists of lstmf filenames for >>>> training and eval" >>>> @echo " training Start training" >>>> @echo " proto-model Build the proto model" >>>> @echo " leptonica Build leptonica" >>>> @echo " tesseract Build tesseract" >>>> @echo " tesseract-langs Download tesseract-langs" >>>> @echo " langdata Download langdata" >>>> @echo " clean Clean all generated files" >>>> @echo "" >>>> @echo " Variables" >>>> @echo "" >>>> @echo " MODEL_NAME Name of the model to be built" >>>> @echo " CORES No of cores to use for compiling >>>> leptonica/tesseract" >>>> @echo " LEPTONICA_VERSION Leptonica version. Default: >>>> $(LEPTONICA_VERSION)" >>>> @echo " TESSERACT_VERSION Tesseract commit. Default: >>>> $(TESSERACT_VERSION)" >>>> @echo " LANGDATA_VERSION Tesseract langdata version. Default: >>>> $(LANGDATA_VERSION)" >>>> @echo " TESSDATA_REPO Tesseract model repo to use. Default: >>>> $(TESSDATA_REPO)" >>>> @echo " TRAIN Train directory" >>>> @echo " RATIO_TRAIN Ratio of train / eval training data" >>>> >>>> # END-EVAL >>>> >>>> # Ratio of train / eval training data >>>> RATIO_TRAIN := 0.90 >>>> >>>> ALL_BOXES = data/all-boxes >>>> ALL_LSTMF = data/all-lstmf >>>> >>>> # Create unicharset >>>> unicharset: data/unicharset >>>> >>>> # Create lists of lstmf filenames for training and eval >>>> lists: $(ALL_LSTMF) data/list.train data/list.eval >>>> >>>> data/list.train: $(ALL_LSTMF) >>>> total=`cat $(ALL_LSTMF) | wc -l` \ >>>> no=`echo "$$total * $(RATIO_TRAIN) / 1" | bc`; \ >>>> head -n "$$no" $(ALL_LSTMF) > "$@" >>>> >>>> data/list.eval: $(ALL_LSTMF) >>>> total=`cat $(ALL_LSTMF) | wc -l` \ >>>> no=`echo "($$total - $$total * $(RATIO_TRAIN)) / 1" | bc`; \ >>>> tail -n "+$$no" $(ALL_LSTMF) > "$@" >>>> >>>> # Start training >>>> training: data/$(MODEL_NAME).traineddata >>>> >>>> data/unicharset: $(ALL_BOXES) >>>> combine_tessdata -u $(TESSDATA)/$(CONTINUE_FROM).traineddata >>>> $(TESSDATA)/$(CONTINUE_FROM). >>>> unicharset_extractor --output_unicharset "$(TRAIN)/my.unicharset" >>>> --norm_mode $(NORM_MODE) "$(ALL_BOXES)" >>>> merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset >>>> $(TRAIN)/my.unicharset "$@" >>>> $(ALL_BOXES): $(sort $(patsubst %.tif,%.box,$(wildcard $(TRAIN)/*.tif))) >>>> find $(TRAIN) -name '*.box' -exec cat {} \; > "$@" >>>> $(TRAIN)/%.box: $(TRAIN)/%.tif $(TRAIN)/%-gt.txt >>>> python generate_line_box.py -i "$(TRAIN)/$*.tif" -t >>>> "$(TRAIN)/$*-gt.txt" > "$@" >>>> >>>> $(ALL_LSTMF): $(sort $(patsubst %.tif,%.lstmf,$(wildcard >>>> $(TRAIN)/*.tif))) >>>> find $(TRAIN) -name '*.lstmf' -exec echo {} \; | sort -R -o "$@" >>>> >>>> $(TRAIN)/%.lstmf: $(TRAIN)/%.box >>>> tesseract $(TRAIN)/$*.tif $(TRAIN)/$* --psm 6 lstm.train >>>> >>>> # Build the proto model >>>> proto-model: data/$(MODEL_NAME)/$(MODEL_NAME).traineddata >>>> >>>> data/$(MODEL_NAME)/$(MODEL_NAME).traineddata: $(LANGDATA) >>>> data/unicharset >>>> combine_lang_model \ >>>> --input_unicharset data/unicharset \ >>>> --script_dir $(LANGDATA) \ >>>> --words $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).wordlist \ >>>> --numbers $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).numbers \ >>>> --puncs $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).punc \ >>>> --output_dir data/ \ >>>> --lang $(MODEL_NAME) >>>> >>>> data/checkpoints/$(MODEL_NAME)_checkpoint: unicharset lists proto-model >>>> mkdir -p data/checkpoints >>>> lstmtraining \ >>>> --continue_from $(TESSDATA)/$(CONTINUE_FROM).lstm \ >>>> --old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \ >>>> --traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \ >>>> --model_output data/checkpoints/$(MODEL_NAME) \ >>>> --debug_interval -1 \ >>>> --train_listfile data/list.train \ >>>> --eval_listfile data/list.eval \ >>>> --sequential_training \ >>>> --max_iterations 3000 >>>> >>>> data/$(MODEL_NAME).traineddata: >>>> data/checkpoints/$(MODEL_NAME)_checkpoint >>>> lstmtraining \ >>>> --stop_training \ >>>> --continue_from $^ \ >>>> --old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \ >>>> --traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \ >>>> --model_output $@ >>>> >>>> # Clean all generated files >>>> clean: >>>> find data/train -name '*.box' -delete >>>> find data/train -name '*.lstmf' -delete >>>> rm -rf data/all-* >>>> rm -rf data/list.* >>>> rm -rf data/$(MODEL_NAME) >>>> rm -rf data/unicharset >>>> rm -rf data/checkpoints >>>> >>>> On Fri, Jun 29, 2018 at 5:31 PM Lorenzo Bolzani <l.bo...@gmail.com >>>> <javascript:>> wrote: >>>> >>>>> >>>>> >>>>> Hi, >>>>> I'm trying to do fine tuning of an existing model using line images >>>>> and text labels. I'm running this version: >>>>> >>>>> tesseract 4.0.0-beta.3-56-g5fda >>>>> leptonica-1.76.0 >>>>> libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : >>>>> libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.4 : libopenjp2 2.3.0 >>>>> Found AVX2 >>>>> Found AVX >>>>> Found SSE >>>>> >>>>> >>>>> >>>>> I used OCR-D to generate lstmf files for the demo data. >>>>> >>>>> If I run the make command it works fine. >>>>> >>>>> make training MODEL_NAME=prova >>>>> >>>>> Now I isolated this command from the build: >>>>> >>>>> lstmtraining \ >>>>> --traineddata data/prova/prova.traineddata \ >>>>> --net_spec "[1,36,0,1 Ct3,3,16 Mp3,3 Lfys48 Lfx96 Lrx96 Lfx256 >>>>> O1c`head -n1 data/unicharset`]" \ >>>>> --model_output data/checkpoints/prova \ >>>>> --learning_rate 20e-4 \ >>>>> --train_listfile data/list.train \ >>>>> --eval_listfile data/list.eval \ >>>>> --max_iterations 10000 >>>>> >>>>> and it works fine. >>>>> >>>>> Now I'm trying to modify it to fine tune the existing eng model. I >>>>> made a few attempts, all ending into different errors (see the attached >>>>> file for full output). >>>>> >>>>> I used: >>>>> >>>>> combine_tessdata -e /usr/local/share/tessdata/eng.traineddata >>>>> extracted/eng.lstm >>>>> >>>>> to extract the eng.lstm model. >>>>> >>>>> This seems to works but I'm not sure it is the correct. >>>>> >>>>> lstmtraining \ >>>>> --continue_from extracted/eng.lstm \ >>>>> --traineddata data/prova/prova.traineddata \ >>>>> --old_traineddata extracted/eng.traineddata \ >>>>> --model_output data/checkpoints/prova \ >>>>> --learning_rate 20e-4 \ >>>>> --train_listfile data/list.train \ >>>>> --eval_listfile data/list.eval \ >>>>> --max_iterations 10000 >>>>> >>>>> (extracted/eng.traineddata is just a copy of eng.traineddata) >>>>> >>>>> >>>>> The training resume exactly with the RMS of prova_checkpoint (6%) so >>>>> it looks like it is training from that checkpoint, not the eng.lstm. >>>>> >>>>> Is this correct? What should I change? >>>>> >>>>> I'm following this guide: >>>>> >>>>> >>>>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters >>>>> >>>>> >>>>> I think continue_from and traineddata should refer to the eng model >>>>> and old_traineddata should point to prova.traineddata, but if I do that I >>>>> get a segmentation fault: >>>>> >>>>> [...] >>>>> !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244 >>>>> !int_mode_:Error:Assert failed:in file weightmatrix.cpp, line 244 >>>>> Segmentation fault >>>>> >>>>> What am I missing? >>>>> >>>>> >>>>> Thanks, bye >>>>> >>>>> Lorenzo >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesser...@googlegroups.com <javascript:>. >>>>> To post to this group, send email to tesser...@googlegroups.com >>>>> <javascript:>. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyOJN31PdWQumXPO3JjuAc1Yz2BZYpMd4ftzBHgZkEaxA%40mail.gmail.com >>>>> >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLyOJN31PdWQumXPO3JjuAc1Yz2BZYpMd4ftzBHgZkEaxA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> >>>> -- >>>> >>>> ____________________________________________________________ >>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to tesser...@googlegroups.com <javascript:>. >>>> To post to this group, send email to tesser...@googlegroups.com >>>> <javascript:>. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWe%3Dv9YvYAMTAzm9yNEFFtqjnxBVGDe9x4tQd1Pnjiwqw%40mail.gmail.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWe%3Dv9YvYAMTAzm9yNEFFtqjnxBVGDe9x4tQd1Pnjiwqw%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesser...@googlegroups.com <javascript:>. >>> To post to this group, send email to tesser...@googlegroups.com >>> <javascript:>. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwUVJOePiO98piAgbSoqyA1GOrs%2BDwEz%2BxY9LS8YQyi%3DQ%40mail.gmail.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLwUVJOePiO98piAgbSoqyA1GOrs%2BDwEz%2BxY9LS8YQyi%3DQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> >> -- >> >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesser...@googlegroups.com <javascript:>. >> To post to this group, send email to tesser...@googlegroups.com >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU0aF%3DKmDPf9V3925bYouhTF%3Dq_XM-Xo5R%3Dv-yC%3DBRrRA%40mail.gmail.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU0aF%3DKmDPf9V3925bYouhTF%3Dq_XM-Xo5R%3Dv-yC%3DBRrRA%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4242cfd0-d808-492d-967c-06731cc39d00%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.