Hi, I am trying to train Tesseract for Sinhala language. I was following training guidelines <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#creating-starter-traineddata> mentioned in Github wiki. I get an error with reference to the 4th step which is "Creating Starter Traineddata". Please find the below command I executed,
training/combine_lang_model --input_unicharset ../training/sin/sin.unicharset --script_dir ../langdata --words ../langdata/sin/sin.wordlist --puncs ../langdata/sin/sin.punc --numbers ../langdata/sin/sin.numbers --output_dir ../training/combined_sin --version_str 1.0 --lang sin I get the following output, Loaded unicharset of size 94 from file ../training/sin/sin.unicharset Setting unichar properties Setting script properties Warning: properties incomplete for index 4 = ී Warning: properties incomplete for index 6 = ි Warning: properties incomplete for index 11 = ු Warning: properties incomplete for index 15 = ් Warning: properties incomplete for index 33 = ූ Warning: properties incomplete for index 52 = ්ර Warning: properties incomplete for index 56 = ්ය Warning: properties incomplete for index 87 = ක් Warning: properties incomplete for index 93 = ර් Config file is optional, continuing... Null char=2 Invalid format in radical table at line 4: 3400 1.4 Creation of encoded unicharset failed!! Error writing recoder!! Reducing Trie to SquishedDawg Reducing Trie to SquishedDawg Reducing Trie to SquishedDawg For more information I have attached my sin.unicharset file and sin.config files. I use below Tesseract version, tesseract -v tesseract 4.00.00dev-696-geba0ae3 leptonica-1.74.4 libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.6 : zlib 1.2.8 Found SSE I use below OS, uname -a Linux shandigutt-laptop-ubuntu 4.4.0-130-generic #156-Ubuntu SMP Thu Jun 14 08:53:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux Appreciate if somebody can please help me on this. Thannks -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/84872636-f425-4cc0-b228-00e7a3f5b6a3%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
sin.config
Description: Binary data
sin.unicharset
Description: Binary data

