Re: [XeTeX] Word wrapping in Lao

François Charette Wed, 21 Apr 2010 07:07:52 -0700

For computational linguistic applications, where the wrong word boundary
results in a mis-parse, I believe that finding "correct" word boundaries is
still a research problem, and cannot be solved by dictionary lookup alone.
For Thai, which is (I believe) similar to Lao in this respect, you might
have a look at this:
    http://www.cs.cmu.edu/~paisarn/software.html
which implements three algorithms: Longest Matching, Maximal Matching and
Part-of-Speech Bigram. That's a bit old, but it gives some idea of the
depth of the problem.  Or there's a comparison of different approaches for
Thai (which I believe dates from 2008) here:
    http://www.cs.ait.ac.th/~mdailey/papers/Choochart-Wordseg.pdf
If you want more, try googling 'word segmentation thai' (you can google
for Lao too, but it appears there has been much more research on word
segmentation for Thai).

Thanks for these interesting links.


I am also aware of this:

http://linux.thai.net/pub/thailinux/cvs/software/cttex/ (which is alsopackaged in Debian). It is another dictionary-based tool for findingThai wordbreaks. I have actually used it to generate wordbreak macros inthe file example-thai.tex that comes with polyglossia. I don't knowwhich algorithm it relies upon (probably "longest matching"). Howeverthe approach suggested by Jonathan (namely the ICU implementation via\XeTeXlinebreaklocale "th") may actually be superior to the above. It isin any case the most convenient one for XeTeX users, as it relieves fromthe necessity of using a preprocessor.

BTW, I just checked the latest sources of ICU4C: there is indeed no suchimplementation for Lao yet (nor for Khmer or Myanmar afaics). I amhowever puzzled by the fact that the ICU source tarball does not appearto provide a Thai dictionary for word-breaking purposes, even though theengine implies the availability of such a dictionary (I expected a filelike "thaidict.brk" somewhere, which is mentioned insource/tools/genrb/genrb.c). Or did I miss something?


FC


--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
 http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] Word wrapping in Lao

Reply via email to