Re: [tesseract-ocr] Re: Retrain tesseract 4 model from real image (not from text file and tesstrain.sh)

Lorenzo Bolzani Sat, 27 Oct 2018 03:50:48 -0700

Check the unicharset file to see if all the characters you want to
recognize are there.


combine_tessdata -u trained_model.traineddata output_dir
cat output_dir/*unicharset


Otherwise you need to merge the old one with the new one before training.

This is how ocrd-train <https://github.com/OCR-D/ocrd-train> does it (you
could try to use it BTW).

combine_tessdata -u $(TESSDATA)/$(CONTINUE_FROM).traineddata
$(TESSDATA)/$(CONTINUE_FROM).
unicharset_extractor --output_unicharset "$(TRAIN)/my.unicharset"
--norm_mode $(NORM_MODE) "$(ALL_BOXES)"
merge_unicharsets $(TESSDATA)/$(CONTINUE_FROM).lstm-unicharset
$(TRAIN)/my.unicharset merged.unicharset

my.unicharset is the new one, something.lstm-unicharset is the old one,
NORM_MODE = 2, ALL_BOXES is a file with all the box files names.

And then something like this: combine_tessdata -o continue_from.traineddata
merged.unicharset

It's probably the same thing that Qt-box-editor does. I never tried this, I
use ocrd that does things ib a little different way.

At the very beginning of the training lstmtraining will print if the set of
characters is different from the previous model.



Bye

Lorenzo

Il giorno sab 27 ott 2018 alle ore 08:04 tu tonquang <tonquangt...@gmail.com>
ha scritto:

> It's similiar with my problem. It well recognized for special characters
> (new data trained) but wrongly recognize for normal characters and word.
>
> Vào 11:29 T.7, 27 Th10 2018 Sreehari B S <sreeharib...@gmail.com> đã viết:
>
>> Hi,
>>
>> Something similar happened when finetuned for :. When doing ice, it
>> recognized some : as 1. So I fine-tuned the same.
>>
>> Now when I ocr : , it works well. When I ice some real data it's now
>> worser than the previous one.
>>
>> * I trained on best eng.traineddata
>> * I created boxes using tesseract make box command and this was edited
>> using jTessBoxEditor. But the box dimensions were not so perfect.
>> Note : I trained from a real image. (Do I really need to edit the
>> coordinates by hand to adjust the dimensions ?)
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/f89a3852-3d89-477f-ad58-6cf2cea12aab%40googlegroups.com
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAH1O8a9-M4dMtZj0k6CgHnQU_bO88mmLWqZUCFm5iDGRjK1_gw%40mail.gmail.com
> <https://groups.google.com/d/msgid/tesseract-ocr/CAH1O8a9-M4dMtZj0k6CgHnQU_bO88mmLWqZUCFm5iDGRjK1_gw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAMgOLLzRzNzh%2BJPXtiwkXKJhOvkMZvy2Di_ZWJoc07bLYtWdWg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Retrain tesseract 4 model from real image (not from text file and tesstrain.sh)

Reply via email to