Hi, tesseract-ocr group. I have a question about the subject.
If I perform OCR in Japanese using best/jpn.traineddata, the address or bank name text will be divided into the following words. ・Ex1 - Document Text : 東京都渋谷区桜丘町 - Word output : 東京, 都, 渋谷, 区, 桜丘, 町 ・Ex2 - Document Text : 三菱東京UFJ銀行 - Word output : 三菱, 東京, UFJ, 銀行 I want to output as one word instead of the above output. For that reason, I am implementing fine tuning, but the OCR result of the character only changes, and the word breaks are not improved. Is there a way to improve this situation? In addition, the methods that have been tried so far are described below. ・ Enviroment - Tesseract version: 4.1.1 - OS: ubuntu 18.04 - tessdata: https://github.com/tesseract-ocr/tessdata - langdata: https://github.com/tesseract-ocr/langdata - jpn.wordlist: List the character strings you want to recognize as one word (ex. 東京都渋谷区桜丘町, 三菱東京UFJ銀行, etc...) --jpn.training_text: Randomly generated document from jpn.wordlist ・ Command # create train data tesstrain.sh \ --fonts_dir /usr/share/fonts \ --lang jpn \ --linedata_only \ --noextract_font_properties \ --langdata_dir ./langdata \ --tessdata_dir /usr/local/share/tessdata \ --output_dir ./output/jpn \ --training_text ./langdata/jpn/jpn.training_text # train lstmtraining --model_output ./model_output/ \ --traineddata /usr/local/share/tessdata/best/jpn.traineddata \ --old_traineddata /usr/local/share/tessdata/best/jpn.traineddata \ --continue_from ./output/jpn/jpn.lstm \ --train_listfile ./output/jpn/jpn.training_files.txt \ --max_iterations 200 \ --debug_interval -1 \ --append_index 5 --net_spec'[Lfx512 O1c1]' \ --learning_rate 20e-4 &> ./output/jpn/train.log #convert to trained data lstmtraining --stop_training \ --continue_from ./model_output/_checkpoint \ --traineddata /usr/local/share/tessdata/best/jpn/jpn.traineddata \ --model_output model_output / jpn.traineddata Please let me know if I have any missing information. Thank you. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3837bc39-40da-4a90-8ae1-80c194f485abn%40googlegroups.com.