[tesseract-ocr] Fine tuning tha.traineddata with character that is not in original unichaset file

Unnop Paripunnang Mon, 19 Sep 2022 21:52:34 -0700

I would like to fine tuning tesseract traineddata with Thai language (tha). 
But unfortunately, after extract original tha.traineddata from official 
tesseract tessdata-best. I've found that there is some character missing in 
tha.unicharset, e.g. Thai number ๐ ๑ ๒ ๓ ๕ (0 1 2 3 5) is appear in 
tha.unicharset but ๔ ๖ ๗ ๘ ๙ (4 6 7 8 9) is missing.


I have fine-tune original tha.traineddata to be included missing character. 
If I try to train it with new tha.training_text that include missing 
character with original tha.traineddata. 'Can't encode transcription:' will 
be appear on the whole training process if which line contained missing 
character.

example:
Can't encode transcription: 
'๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗'
 
in language '' Encoding of string failed! Failure bytes: e0 b9 98 e0 b9 98 
e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 
b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 
98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 
e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 
b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 
98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 
e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 
b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 
98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 
e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 
b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 
98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 
e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 
b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 
98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 
e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 
b9 98 e0 b9 98

My question is,  Is there any solution to fine tune tha.traineddata to be 
included some text that is not included in the unicharset file extracted 
from tha.traineddata? Could someone advise me that there is any solution 
for this?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6ddacf31-a1bf-4856-b247-17d45583c8d9n%40googlegroups.com.

[tesseract-ocr] Fine tuning tha.traineddata with character that is not in original unichaset file

Reply via email to