On Mon, Nov 22, 2010 at 6:50 PM, Burton-West, Tom <tburt...@umich.edu> wrote: > Hi all, > > I see in the javadoc for the ICUTokenizer that it has special handling for > Lao,Myanmar, Khmer word breaking but no details in the javadoc about what it > does with CJK, which for C and J appears to be breaking into unigrams. Is > this correct? >
The han ideographs are segmented into unigram (this is the uax#29 default behavior). I don't know off the top of my head what the rules are for japanese kana... --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org