Re: ICUTokenizer and CJK

Robert Muir Tue, 23 Nov 2010 03:08:17 -0800

On Mon, Nov 22, 2010 at 6:50 PM, Burton-West, Tom <tburt...@umich.edu> wrote:
> Hi all,
>
> I see in the javadoc for the ICUTokenizer that it has special handling for 
> Lao,Myanmar, Khmer word breaking but no details in the javadoc about what it 
> does with CJK, which for C and J appears to be breaking into unigrams. Is 
> this correct?
>


The han ideographs are segmented into unigram (this is the uax#29
default behavior). I don't know off the top of my head what the rules
are for japanese kana...

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: ICUTokenizer and CJK

Reply via email to