Re: [tesseract-ocr] Can I mix tiff/box files generated by ocrd-train with original training data used to train specific language in tesseract4 (from langdata direcotry)

Raniem AROUR Tue, 04 Sep 2018 07:12:19 -0700

Thanks Shree for your quick reply.
I have already used the version you altered of the Makefile for finetuning 
which you shared in one of the threads I have referenced above.I also tried 
this one (which is the same except for passing -- pass_through_recoder to 
the combine_lang_model which I will research and understand what difference 
it makes or maybe you can advice me please)


I appreciate the support, but my main question is about your suggestion of 
merging training data as in this thread 
<https://github.com/tesseract-ocr/tesseract/issues/1172>:

I copy them to my langdata/language directory and then use a modified 
tesstrain.sh to copy them to the tmp training directory. tesstrain.sh 
changes ``` mkdir -p ${TRAINING_DIR} tlog "\n=== Starting training for 
language '${LANG_CODE}'" cp ../langdata/${LANG_CODE}/*.box ${TRAINING_DIR} 
cp ../langdata/${LANG_CODE}/*.tif ${TRAINING_DIR} ls -l ${TRAINING_DIR} 
source "$(dirname $0)/language-specific.sh" ```
after doing those steps tesstrain.sh worked and generated *.lstmf files 
which I have copied to where my training data is and ran "make training" 
again. The process worked and generated a final model but there were some 
errors as the one I quoted in my original post. And the final unicharsets 
is identical with the original one from the original model but there is 
regression in accuracy compared to original one. 

I though maybe it is bad idea to merge data from ocrd-train with original 
data as box formats look different and wanted to get an advise.

Thanks and appreciate all the time you spend supporting people.


Regards

On Tuesday, September 4, 2018 at 2:30:08 PM UTC+1, shree wrote:

> For finetuning,  I like to use the original unicharset alongwith the 
> unicharset from the training set so that all characters are included.
>
> Please see below a modified makefile that can be used for this - please 
> make changes as per your requirements.
>
> export
>
> SHELL := /bin/bash
> LOCAL := $(PWD)/usr
> PATH := $(LOCAL)/bin:$(PATH)
> HOME := /home/ubuntu
> TESSDATA =  $(HOME)/tessdata_best
> LANGDATA = $(HOME)/langdata
>
> # Name of the model to be built
> MODEL_NAME = san
>
> # Name of the model to continue from
> CONTINUE_FROM = san
>
> # Normalization Mode - see src/training/language_specific.sh for details 
> NORM_MODE = 2
>
> # Tesseract model repo to use. Default: $(TESSDATA_REPO)
> TESSDATA_REPO = _best
>
> # Train directory
> TRAIN := data/train
>
> # BEGIN-EVAL makefile-parser --make-help Makefile
>
> help:
> @echo ""
> @echo "  Targets"
> @echo ""
> @echo "    unicharset       Create unicharset"
> @echo "    lists            Create lists of lstmf filenames for training 
> and eval"
> @echo "    training         Start training"
> @echo "    proto-model      Build the proto model"
> @echo "    leptonica        Build leptonica"
> @echo "    tesseract        Build tesseract"
> @echo "    tesseract-langs  Download tesseract-langs"
> @echo "    langdata         Download langdata"
> @echo "    clean            Clean all generated files"
> @echo ""
> @echo "  Variables"
> @echo ""
> @echo "    MODEL_NAME         Name of the model to be built"
> @echo "    CORES              No of cores to use for compiling 
> leptonica/tesseract"
> @echo "    LEPTONICA_VERSION  Leptonica version. Default: 
> $(LEPTONICA_VERSION)"
> @echo "    TESSERACT_VERSION  Tesseract commit. Default: 
> $(TESSERACT_VERSION)"
> @echo "    LANGDATA_VERSION   Tesseract langdata version. Default: 
> $(LANGDATA_VERSION)"
> @echo "    TESSDATA_REPO      Tesseract model repo to use. Default: 
> $(TESSDATA_REPO)"
> @echo "    TRAIN              Train directory"
> @echo "    RATIO_TRAIN        Ratio of train / eval training data"
>
> # END-EVAL
>
> # Ratio of train / eval training data
> RATIO_TRAIN := 0.90
>
> ALL_BOXES = data/all-boxes
> ALL_LSTMF = data/all-lstmf
>
> # Create unicharset
> unicharset: data/unicharset
>
> # Create lists of lstmf filenames for training and eval
> lists: $(ALL_LSTMF) data/list.train data/list.eval
>
> data/list.train: $(ALL_LSTMF)
> total=`cat $(ALL_LSTMF) | wc -l` \
>    no=`echo "$$total * $(RATIO_TRAIN) / 1" | bc`; \
>    head -n "$$no" $(ALL_LSTMF) > "$@"
>
> data/list.eval: $(ALL_LSTMF)
> total=`cat $(ALL_LSTMF) | wc -l` \
>    no=`echo "($$total - $$total * $(RATIO_TRAIN)) / 1" | bc`; \
>    tail -n "+$$no" $(ALL_LSTMF) > "$@"
>
> # Start training
> training: data/$(MODEL_NAME).traineddata
>
> data/unicharset: $(ALL_BOXES)
> combine_tessdata -u $(TESSDATA)/$(CONTINUE_FROM).traineddata  
> $(TESSDATA)/$(CONTINUE_FROM).
> unicharset_extractor --output_unicharset "$(TRAIN)/my.unicharset" 
> --norm_mode $(NORM_MODE) "$(ALL_BOXES)"
> merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset 
> $(TRAIN)/my.unicharset  "$@"
> $(ALL_BOXES): $(sort $(patsubst %.tif,%.box,$(wildcard $(TRAIN)/*.tif)))
> find $(TRAIN) -name '*.box' -exec cat {} \; > "$@"
> $(TRAIN)/%.box: $(TRAIN)/%.tif $(TRAIN)/%-gt.txt
> python generate_line_box.py -i "$(TRAIN)/$*.tif" -t "$(TRAIN)/$*-gt.txt" > 
> "$@"
>
> $(ALL_LSTMF): $(sort $(patsubst %.tif,%.lstmf,$(wildcard $(TRAIN)/*.tif)))
> find $(TRAIN) -name '*.lstmf' -exec echo {} \; | sort -R -o "$@"
>
> $(TRAIN)/%.lstmf: $(TRAIN)/%.box
> tesseract $(TRAIN)/$*.tif $(TRAIN)/$*   --psm 6 lstm.train
>
> # Build the proto model
> proto-model: data/$(MODEL_NAME)/$(MODEL_NAME).traineddata
>
> data/$(MODEL_NAME)/$(MODEL_NAME).traineddata: $(LANGDATA) data/unicharset
> combine_lang_model \
>   --input_unicharset data/unicharset \
>   --pass_through_recoder \
>   --script_dir $(LANGDATA) \
>   --words $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).wordlist \
>   --numbers $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).numbers \
>   --puncs $(LANGDATA)/$(MODEL_NAME)/$(MODEL_NAME).punc \
>   --output_dir data/ \
>   --lang $(MODEL_NAME)
>
> data/checkpoints/$(MODEL_NAME)_checkpoint: unicharset lists proto-model
> mkdir -p data/checkpoints
> lstmtraining \
>   --continue_from   $(TESSDATA)/$(CONTINUE_FROM).lstm \
>   --old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \
>   --traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \
>   --model_output data/checkpoints/$(MODEL_NAME) \
>   --debug_interval -1 \
>   --train_listfile data/list.train \
>   --eval_listfile data/list.eval \
>   --sequential_training \
>   --max_iterations 3000
>
> data/$(MODEL_NAME).traineddata: data/checkpoints/$(MODEL_NAME)_checkpoint
> lstmtraining \
> --stop_training \
> --continue_from $^ \
> --old_traineddata $(TESSDATA)/$(CONTINUE_FROM).traineddata \
> --traineddata data/$(MODEL_NAME)/$(MODEL_NAME).traineddata \
> --model_output $@
>
> # Clean all generated files
> clean:
> find data/train -name '*.box' -delete
> find data/train -name '*.lstmf' -delete
> rm -rf data/all-*
> rm -rf data/list.*
> rm -rf data/$(MODEL_NAME)
> rm -rf data/unicharset
> rm -rf data/checkpoints
>
>
>
> On Tue, Sep 4, 2018 at 4:48 PM, Raniem AROUR <raniem...@gmail.com 
> <javascript:>> wrote:
>
>> Hello..
>>
>> I am trying to fine tune the dan.traineddata for my specific use case. 
>> However, the model is over fitting on my data and seems to be forgetting 
>> the original data it was trained on. I remember I have read somewhere that 
>> this can be solved by showing the original training data to the network so 
>> that I don't get regression over the original performance.
>>
>> I have images and their corresponding ground truth files. Therefore I 
>> have used ocrd-train <https://github.com/OCR-D/ocrd-train> to do the 
>> fine tuning earlier (using some advises found in this thread 
>> <https://groups.google.com/forum/#!searchin/tesseract-ocr/fine$20tuning$20english$20language%7Csort:date/tesseract-ocr/be4-rjvY2tQ/32evtMHlAQAJ>,
>>  
>> thanks to Shree).
>> I have then mixed my training data with the original training data using 
>> the hints provided by shree in this thread 
>> <https://github.com/tesseract-ocr/tesseract/issues/1172>.
>>
>> the command i used after updating the tesstrain.sh as recommended was: 
>>
>> ~/tesseract/src/training/tesstrain.sh --fonts_dir /usr/share/fonts --lang 
>> dan --linedata_only \
>>   --noextract_font_properties --langdata_dir 
>> /home/my_user/ocrd-train/langdata \
>>   --tessdata_dir /home/my_user/tesseract/tessdata \
>>   --output_dir /home/my_user/my_models/danNew/
>>
>>
>>
>> then I tried to run "make training" in the ocrd-train directory as I 
>> usually do for fine tuning. The fine tuning started, however, I got some 
>> errors that I believe are resulted from the original data:
>> e.g. Encoding of string failed! Failure bytes: ffffffc3 ffffffb6 20 65 72 
>> 20 31 2e 34 35 24 2e 20 74 69 64 6c 69 67 65 72 65 20 31 37 2e 20 68 61 76 
>> 65 20 6d 61 6e 67 65 20 4e 59 20 2d 20 76 ffffffc3 ffffffa6 72 65 20 69 20 
>> 53 ffffffc3 ffffff85 20 43 61 6e 61 6c 2b 20 6f 67
>> Can't encode transcription: 'har Søg butik været blevet Ifö er 1.45$. 
>> tidligere 17. have mange NY - være i SÅ Canal+ og' in language ''
>> Encoding of string failed! Failure bytes: ffffffc3 ffffffb6 20 65 72 20 
>> 31 2e 34 35 24 2e 20 74 69 64 6c 69 67 65 72 65 20 31 37 2e 20 68 61 76 65 
>> 20 6d 61 6e 67 65 20 4e 59 20 2d 20 76 ffffffc3 ffffffa6 72 65 20 69 20 53 
>> ffffffc3 ffffff85 20 43 61 6e 61 6c 2b 20 6f 67
>> Can't encode transcription: 'har Søg butik været blevet Ifö er 1.45$. 
>> tidligere 17. have mange NY - være i SÅ Canal+ og' in language ''
>>
>> P.S. I know the box resulted by ocrd-train looks different from the usual 
>> box used for training tesseract4 but it worked fine-tunning other models 
>> and was wondering whether it is a bad idea just to mix them this way.
>>
>> What  could have been gone wrong in this process? I appreciate every 
>> suggestion.
>>
>>
>> Kind Regards
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com <javascript:>.
>> To post to this group, send email to tesser...@googlegroups.com 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/e9676a7b-7396-4d05-8978-97c9bfbc387f%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/e9676a7b-7396-4d05-8978-97c9bfbc387f%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> -- 
>
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8f122c52-30da-45c5-8a97-426bb388047e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Can I mix tiff/box files generated by ocrd-train with original training data used to train specific language in tesseract4 (from langdata direcotry)

Reply via email to