[tesseract-ocr] How jpn word separation improve with fine tuning.

Yudai Sano Tue, 30 Nov 2021 23:07:44 -0800

Hi, tesseract-ocr group.

I have a question about the subject.


If I perform OCR in Japanese using best/jpn.traineddata, the address or 
bank name text will be divided into the following words.

・Ex1
　- Document Text : 東京都渋谷区桜丘町
　- Word output : 東京, 都, 渋谷, 区, 桜丘, 町
・Ex2
　- Document Text : 三菱東京UFJ銀行
　- Word output : 三菱, 東京, UFJ, 銀行

I want to output as one word instead of the above output.
For that reason, I am implementing fine tuning, but the OCR result of the 
character only changes, and the word breaks are not improved.

Is there a way to improve this situation?

In addition, the methods that have been tried so far are described below.

・ Enviroment
- Tesseract version: 4.1.1
- OS: ubuntu 18.04
- tessdata: https://github.com/tesseract-ocr/tessdata
- langdata: https://github.com/tesseract-ocr/langdata
- jpn.wordlist: List the character strings you want to recognize as one word
  (ex. 東京都渋谷区桜丘町, 三菱東京UFJ銀行, etc...)
--jpn.training_text: Randomly generated document from jpn.wordlist

・ Command
# create train data
tesstrain.sh \
    --fonts_dir /usr/share/fonts \
    --lang jpn \
    --linedata_only \
    --noextract_font_properties \
    --langdata_dir ./langdata \
    --tessdata_dir /usr/local/share/tessdata \
    --output_dir ./output/jpn \
    --training_text ./langdata/jpn/jpn.training_text

# train
lstmtraining --model_output ./model_output/ \
    --traineddata /usr/local/share/tessdata/best/jpn.traineddata \
    --old_traineddata /usr/local/share/tessdata/best/jpn.traineddata \
    --continue_from ./output/jpn/jpn.lstm \
    --train_listfile ./output/jpn/jpn.training_files.txt \
    --max_iterations 200 \
    --debug_interval -1 \
    --append_index 5 --net_spec'[Lfx512 O1c1]' \
    --learning_rate 20e-4 &> ./output/jpn/train.log

#convert to trained data
lstmtraining --stop_training \
    --continue_from ./model_output/_checkpoint \
    --traineddata /usr/local/share/tessdata/best/jpn/jpn.traineddata \
    --model_output model_output / jpn.traineddata

Please let me know if I have any missing information.

Thank you.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3837bc39-40da-4a90-8ae1-80c194f485abn%40googlegroups.com.

[tesseract-ocr] How jpn word separation improve with fine tuning.

Reply via email to