The CJKAnalyzer is too simple for our need. But thanks for suggesting anyway.
Cheers, Cedric On Nov 9, 2007 10:43 PM, Open Study <[EMAIL PROTECTED]> wrote: > Hi Cedric > > You may try the CJKAnalyzer within the lucene sandbox. It doesn't give > a perfect solution for Chinese word segmentation, but will solve the > problem in your case. > > > On Nov 9, 2007 10:59 AM, Cedric Ho <[EMAIL PROTECTED]> wrote: > > Hi, > > > > We are having an issue while indexing Chinese Documents in Lucene. > > > > Some background first: > > Since CJK languages doesn't have space between words, we first have to > > determine the words from sentences. e.g. > > > > a sentence containing characters ABC, it may be segmented into AB, C or A, > > BC. > > > > the problem is sometimes there can be ambiguities in how the sentence > > should be segmented. It is possible that > > both AB, C and A, BC are valid segmentations. > > > > In this cases we would like to index both segmentation into the index: > > > > AB offset (0,1) position 0 > > C offset (2,2) position 1 > > A offset (0,0) position 0 > > BC offset (1,2) position 1 > > > > Now the problem is, when someone search using a PhraseQuery (AC) it > > will find this line ABC because it match A (position 0) and C > > (position 1). > > > > Are there any ways to search for exact match using the offset > > information instead of the position information ? > > > > Best Regards, > > Cedric > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]