where can i find the source code? On Tue, Dec 15, 2009 at 9:40 PM, Robert Muir <rcm...@gmail.com> wrote:
> there is an icu transform tokenfilter in the patch here: > http://issues.apache.org/jira/browse/LUCENE-1488 > > Transliterator pinyin = Transliterator.getInstance("Han-Latin"); > Tokenizer tokenizer = new KeywordTokenizer(new StringReader("中国")); > ICUTransformFilter filter = new ICUTransformFilter(tokenizer, pinyin); > assertTokenStreamContents(filter, new String[] { "zhōng guó" } ); > > note it will add tone marks and insert space between syllables by default > if you do not want this, you need to do some cleanup. > > Transliterator pinyin = Transliterator.getInstance("Han-Latin; NFD; > [[:NonspacingMark:][:Space:]] Remove"); > Tokenizer tokenizer = new KeywordTokenizer(new StringReader("中国")); > ICUTransformFilter filter = new ICUTransformFilter(tokenizer, pinyin); > assertTokenStreamContents(filter, new String[] { "zhongguo" } ); > > > 2009/12/15 Weiwei Wang <ww.wang...@gmail.com> > > > Hi, guys, > > I'm implementing a search engine based on Lucene for Chinese. So I > want > > to support pinyin search as Google China do. > > > > e.g. > > “中国” means Chinese in English > > this word's pinyin input is "zhongguo" > > The feature i want to implement is when user type zhongguo the results > will > > include documents containing "中国" or even Chinese > > > > Anybody here know how to achieve this? > > > > -- > > Weiwei Wang > > Alex Wang > > 王巍巍 > > Room 403, Mengmin Wei Building > > Computer Science Department > > Gulou Campus of Nanjing University > > Nanjing, P.R.China, 210093 > > > > Homepage: http://cs.nju.edu.cn/rl/weiweiwang > > > > > > -- > Robert Muir > rcm...@gmail.com > -- Weiwei Wang Alex Wang 王巍巍 Room 403, Mengmin Wei Building Computer Science Department Gulou Campus of Nanjing University Nanjing, P.R.China, 210093 Homepage: http://cs.nju.edu.cn/rl/weiweiwang