Re: Training japanese for 3.0

Zdenko Podobný Sun, 19 Sep 2010 09:53:03 -0700

Dňa 19.09.2010 16:01, Jimmy O'Regan wrote / napísal(a):
> 2010/9/19 Zdenko Podobný <zde...@gmail.com>:
>> Hi Stane,
>>
>> why it doesn't look healthy? ;-)
>> There is one easy way how to find if it correct or not: to test it ;-)
>>
>> BTW: when I searched for mistakes in former wiki (now corrections are
>> included in http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3)
>> I recognized that:
>> a) unicharset_extractor put NULL to type of script (maybe I did something
>> wrong, maybe google did not submit relevant code yet)
> Probably the latter. There are, for example, function prototypes for a
> whole other OCR engine (called 'Cube', IIRC), for which there's no
> matching code.
>
Do you have info when (if) they plan to submit new code?
>> b) in unicharset.cpp there is code that works with these scripts: Latin,
>> Common, Greek, Cyrillic, Han, NULL
> There are more than that. For one, Fraktur is considered a script of its own.
>
Thanks for info. I expected that everything related to script is in
unicharset.cpp. Other scripts are in osdetect.cpp (if somebody is
interested).
>> c) if you extract  unicharset files from some languages (e.g.
>> "combine_tessdata -e jpn.traineddata jpn.unicharset" - Japaneses language
>> file is from svn revision 309) you can find there also another scripts:
>> Hiragana and Katakana
>>
> Yes, those are mentioned in part of the code. What /seems/ to be there
> is an image-based script detection mechanism (the usual mechanism is
> to guess the script based on the types of mistakes) but I haven't seen
> it used.
>


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-...@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Training japanese for 3.0

Reply via email to