Re: [XeTeX] Word wrapping in Lao

maxwell Fri, 16 Apr 2010 08:02:29 -0700

On Fri, 16 Apr 2010 22:34:20 +1000, Andrew Cunningham
<lang.supp...@gmail.com> wrote:
> The south-east asian scripts I tend to work with at the moment, break
at:
> 
> * punctuation
> * phrase boundaries (indicated by white space)
> * word boundaries (no spaces at word boundaries, except when a word
> boundary is a phrase boundary)
> 
> Word segmentation would need a dictionary lookup, probably using
> longest word matching.


For computational linguistic applications, where the wrong word boundary
results in a mis-parse, I believe that finding "correct" word boundaries is
still a research problem, and cannot be solved by dictionary lookup alone. 
For Thai, which is (I believe) similar to Lao in this respect, you might
have a look at this:
   http://www.cs.cmu.edu/~paisarn/software.html
which implements three algorithms: Longest Matching, Maximal Matching and
Part-of-Speech Bigram. That's a bit old, but it gives some idea of the
depth of the problem.  Or there's a comparison of different approaches for
Thai (which I believe dates from 2008) here:
   http://www.cs.ait.ac.th/~mdailey/papers/Choochart-Wordseg.pdf
If you want more, try googling 'word segmentation thai' (you can google
for Lao too, but it appears there has been much more research on word
segmentation for Thai).

Of course, you might not need to do as accurate a job of word breaking for
typesetting as you would for syntactic parsing, say.  In fact, it might be
that a fairly naive method--like longest word matching--would do better
than the average human (who speaks and writes Lao :-)).

   Mike Maxwell


--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] Word wrapping in Lao

Reply via email to