On Mon, Nov 30, 2009 at 2:53 PM, Shai Erera <ser...@gmail.com> wrote:
> Robert, what if I need to do additional filtering after CollationKeyFilter, > like stopwords removal, abbreviations handling, stemming etc? Will that be > possible if I use CollationKeyFilter? > > Shai, great point. This won't work with Collation, because it makes binary keys. This is why I think unicode case folding is still useful. I have a patch to do this located in LUCENE-1488 (and it looks now like i should add back support for alternate mapping) Because currently, it looks to me like theres no real way to do turkish stemming in lucene. The snowball turkish stemmer expects "properly" lowercased terms, and won't modify uppercase terms. public void testNounGenitive() throws IOException { // AĞACIN in turkish lowercases to ağacın, but with lowercase filter ağacin. // this happens to work, even though its lowercased wrong before stemming. // this is because the stemmer also recognizes and removes -in assertAnalyzesTo(analyzer, "ağacın", new String[] { "ağaç" }); assertAnalyzesTo(analyzer, "AĞACIN", new String[] { "ağaç" }); assertAnalyzesTo(analyzer, "ağacin", new String[] { "ağaç" }); } public void testNounAccusative() throws IOException { // AĞACI in turkish lowercases to ağacı, but with lowercase filter ağaci. // this fails due to wrong casing, because the stemmer // will only remove -ı, not -i assertAnalyzesTo(analyzer, "ağacı", new String[] { "ağaç" }); assertAnalyzesTo(analyzer, "AĞACI", new String[] { "ağaci" }); } -- Robert Muir rcm...@gmail.com