Hello community, i am doing an evaluation in the context of CJK. I compare some indexing strategies like "unigram", "bigram", "unigram + bigram" and "word based" indexing.
1. I used the Standardanalyzer for "unigram". I think it works for chinese but it is doing some other staff for Japanese and Korean. In Japanese some characters get combined and for Korean it works like a WhiteSpaceAnalyzer, right? Which Analyzer would you prefer for "unigrams" in Japanese and Korean? Is there any flag in the CJKAnalyzer to output "unigrams" only? 2. I used the CJKAnalyzer for "bigrams" and "unigrams + bigrams". I think it works correct, but i have some performance issues. The Querytime for "unigram + bigram" is about 8-20 times higher than "bigram" only. Any ideas? Thank you. -- View this message in context: http://lucene.472066.n3.nabble.com/CJK-evaluation-Standardanalyzer-and-Querytime-tp4041190.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org