Hi Stane, why it doesn't look healthy? ;-) There is one easy way how to find if it correct or not: to test it ;-)
BTW: when I searched for mistakes in former wiki (now corrections are included in http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3) I recognized that: a) unicharset_extractor put NULL to type of script (maybe I did something wrong, maybe google did not submit relevant code yet) b) in unicharset.cpp there is code that works with these scripts: Latin, Common, Greek, Cyrillic, Han, NULL c) if you extract unicharset files from some languages (e.g. "combine_tessdata -e jpn.traineddata jpn.unicharset" - Japaneses language file is from svn revision 309) you can find there also another scripts: Hiragana and Katakana I do not know if OCR result will be better if you replace NULL with Latin, Common, Han etc. in unicharset file. If you have time please test it and send info to this forum. Zd. Dňa 18.09.2010 13:14, Stane wrote / napísal(a): > Hi folks, > > I try to make my own jpn.traineddata for tesseract 3.0 and for the > beginning with just 10 diffrent Characters/Kanjis which repeates > theirself a few times and are seperates by a space to make sure they > get boxed. > > With tesseract I create the box file, edit it with pytesseracttrainer > to make everything nice and correct. > Next i let run tesseract in training-mode to get a .tr file. So far so > good and every things seems to be correct. > But when i run the unicharset_extractor I get an unicharset which > looks like this > "10 > NULL 0 NULL 0 > 亜 0 NULL 0 > ..." > > Well this doesnt look soo healthy to me, I wonder if it is suposed to > be like this and what did I wrong? Have I to create the unicharset for > japanese manualy? > > Thanks for any help :-) > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-...@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.