Hi Nick, Thanks a lot! I am not sure if Tesseract has already supported the Tibetan language. That is why I asked. :) I will get started with your and sriranga's suggestion and see how far I can get.
Yizhen 在 2015年11月25日星期三 UTC+8下午6:12:06,Nick White写道: > > Hi Yizhen, > > On Tue, Nov 24, 2015 at 07:08:24PM -0800, Yizhen Hai wrote: > > I am working on a volunteer project to digitize the Sutra and all > related > > materials, most of them in Tibetan. > > Sounds like a great project :) > > > Therefore, I wonder how I can get help to use Tesseract for Tibetan. (I > am new > > on both OCR and Tesseract and the only programming language I know is > R.) I > > have no idea how to get started, training Tesseract for a new language? > > Are you sure Tesseract doesn't already support the Tibetan language > you need? I know almost nothing about Tibetan, but I see in the > langdata[0] repository (which is used to build the official training > files) a Tibetan.unicharset file, which implies it probably does > have support. Take a look for the ISO-693 code for the language(s) > you're interested in in the tessdata repository[1]. > > I quickly compared the ISO-693 codes from this wikipedia page[2] > with the tessdata and bod (Lhasa Tibetan) is the only one there that > I see available. But maybe it's the language you want anyway? > > > And what if the image contains both Chinese and Tibetan? Please > > give me some hints. > > Tesseract can be told to expect multiple languages in an image, > using a plus in the language argument (i.e. '-l eng+spa'). > > Hope that's helpful. > > Nick > > 0. https://github.com/tesseract-ocr/langdata > 1. https://github.com/tesseract-ocr/tessdata > 2. https://en.wikipedia.org/wiki/Central_Tibetan_language > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c62d3c0b-c9a7-4bf0-9e2e-ca51b778fca2%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

