I would like to fine tuning tesseract traineddata with Thai language (tha). But unfortunately, after extract original tha.traineddata from official tesseract tessdata-best. I've found that there is some character missing in tha.unicharset, e.g. Thai number ๐ ๑ ๒ ๓ ๕ (0 1 2 3 5) is appear in tha.unicharset but ๔ ๖ ๗ ๘ ๙ (4 6 7 8 9) is missing.
I have fine-tune original tha.traineddata to be included missing character. If I try to train it with new tha.training_text that include missing character with original tha.traineddata. 'Can't encode transcription:' will be appear on the whole training process if which line contained missing character. example: Can't encode transcription: '๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗๗' in language '' Encoding of string failed! Failure bytes: e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 e0 b9 98 My question is, Is there any solution to fine tune tha.traineddata to be included some text that is not included in the unicharset file extracted from tha.traineddata? Could someone advise me that there is any solution for this? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6ddacf31-a1bf-4856-b247-17d45583c8d9n%40googlegroups.com.