2010/4/21 François Charette <firmi...@ankabut.net>: > >> For computational linguistic applications, where the wrong word boundary >> results in a mis-parse, I believe that finding "correct" word boundaries >> is >> still a research problem, and cannot be solved by dictionary lookup alone. >> For Thai, which is (I believe) similar to Lao in this respect, you might >> have a look at this: >> http://www.cs.cmu.edu/~paisarn/software.html >> which implements three algorithms: Longest Matching, Maximal Matching and >> Part-of-Speech Bigram. That's a bit old, but it gives some idea of the >> depth of the problem. Or there's a comparison of different approaches for >> Thai (which I believe dates from 2008) here: >> http://www.cs.ait.ac.th/~mdailey/papers/Choochart-Wordseg.pdf >> If you want more, try googling 'word segmentation thai' (you can google >> for Lao too, but it appears there has been much more research on word >> segmentation for Thai). >> > > Thanks for these interesting links. > > I am also aware of this: > http://linux.thai.net/pub/thailinux/cvs/software/cttex/ (which is also > packaged in Debian). It is another dictionary-based tool for finding Thai > wordbreaks. I have actually used it to generate wordbreak macros in the file > example-thai.tex that comes with polyglossia. I don't know which algorithm > it relies upon (probably "longest matching"). However the approach suggested > by Jonathan (namely the ICU implementation via \XeTeXlinebreaklocale "th") > may actually be superior to the above. It is in any case the most convenient > one for XeTeX users, as it relieves from the necessity of using a > preprocessor. >
This document had some interesting contributions to this discussion: http://www.unifont.org/textlayout/TheBigPicture.pdf In short, there are other people interested in solving this problem to provide proper internationalization in the major FLOSS applications. Here is a research paper on Lao linebreaking in particular: http://www.tcllab.org/events/uploads/valaxay-lao.pdf cheers, -Sam. Cheers, -Sam -------------------------------------------------- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex