Dňa 19.09.2010 16:01, Jimmy O'Regan wrote / napísal(a): > 2010/9/19 Zdenko Podobný <zde...@gmail.com>: >> Hi Stane, >> >> why it doesn't look healthy? ;-) >> There is one easy way how to find if it correct or not: to test it ;-) >> >> BTW: when I searched for mistakes in former wiki (now corrections are >> included in http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3) >> I recognized that: >> a) unicharset_extractor put NULL to type of script (maybe I did something >> wrong, maybe google did not submit relevant code yet) > Probably the latter. There are, for example, function prototypes for a > whole other OCR engine (called 'Cube', IIRC), for which there's no > matching code. > Do you have info when (if) they plan to submit new code? >> b) in unicharset.cpp there is code that works with these scripts: Latin, >> Common, Greek, Cyrillic, Han, NULL > There are more than that. For one, Fraktur is considered a script of its own. > Thanks for info. I expected that everything related to script is in unicharset.cpp. Other scripts are in osdetect.cpp (if somebody is interested). >> c) if you extract unicharset files from some languages (e.g. >> "combine_tessdata -e jpn.traineddata jpn.unicharset" - Japaneses language >> file is from svn revision 309) you can find there also another scripts: >> Hiragana and Katakana >> > Yes, those are mentioned in part of the code. What /seems/ to be there > is an image-based script detection mechanism (the usual mechanism is > to guess the script based on the types of mistakes) but I haven't seen > it used. >
-- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-...@googlegroups.com. To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.