Re: Training japanese for 3.0

Zdenko Podobný Sun, 19 Sep 2010 06:49:22 -0700

 Hi Stane,

why it doesn't look healthy? ;-)
There is one easy way how to find if it correct or not: to test it ;-)


BTW: when I searched for mistakes in former wiki (now corrections are 
included in
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3) I
recognized that:
a) unicharset_extractor put NULL to type of script (maybe I did
something wrong, maybe google did not submit relevant code yet)
b) in unicharset.cpp there is code that works with these scripts: Latin,
Common, Greek, Cyrillic, Han, NULL
c) if you extract  unicharset files from some languages (e.g. 
"combine_tessdata -e jpn.traineddata jpn.unicharset" - Japaneses
language file is from svn revision 309) you can find there also another
scripts: Hiragana and Katakana

I do not know if OCR result will be better if you replace NULL with
Latin, Common, Han etc. in unicharset file. If you have time please test
it and send info to this forum.

Zd.

Dňa 18.09.2010 13:14, Stane  wrote / napísal(a):
> Hi folks,
>
> I try to make my own jpn.traineddata for tesseract 3.0 and for the
> beginning with just 10 diffrent Characters/Kanjis which repeates
> theirself a few times and are seperates by a space to make sure they
> get boxed.
>
> With tesseract I create the box file, edit it with pytesseracttrainer
> to make everything nice and correct.
> Next i let run tesseract in training-mode to get a .tr file. So far so
> good and every things seems to be correct.
> But when i run the unicharset_extractor I get an unicharset which
> looks like this
> "10
> NULL 0 NULL 0
> 亜 0 NULL 0
> ..."
>
> Well this doesnt look soo healthy to me, I wonder if it is suposed to
> be like this and what did I wrong? Have I to create the unicharset for
> japanese manualy?
>
> Thanks for any help :-)
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-...@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Training japanese for 3.0

Reply via email to