(( A complete Mavenized project illustration this problem is available at: https://dl.getdropbox.com/u/34024/lucene-porter-stemmer-bug.zip ))
I am using Lucene to tokenzize, filter, and stem an input text using the following chain: String text = IOUtils.toString(getClass().getClassLoader().getSystemResourceAsStream("J01-1001.txt")); StringReader reader = new StringReader(text); StandardTokenizer tokenizer = new StandardTokenizer(reader); LowerCaseFilter lcFilter = new LowerCaseFilter(tokenizer); StopFilter stopFilter = new StopFilter(lcFilter, CustomStopWords.STOP_WORDS); PorterStemFilter stemmer = new PorterStemFilter(stopFilter); Set<String> stemmedStopWords = buildStemmedStopWords(); Token t = new Token(); while (stemmer.next(t) != null) { if (stemmedStopWords.contains(t.term())) { throw new RuntimeException("\"" + t.term() + "\" must have been removed from the output token stream but is not"); } } However I see some of the stemmed stop words still appearing in the output tokens. For example, "describe" is one of my custom stop words. The stemmed version of describe is "describ". However "describ" appears in the output token stream instead of being filtered out. I thought "describe" will be filtered out by the stopFilter even before reaching the stemmer. However, somehow it is leaking into the stemmer and it is turned into "describ" by it and goes into the output token stream. The funny thing is, when I remove the stemmer from the chain, then "describe" does not go into the output token stream: String text = IOUtils.toString(getClass().getClassLoader().getSystemResourceAsStream("J01-1001.txt")); StringReader reader = new StringReader(text); StandardTokenizer tokenizer = new StandardTokenizer(reader); LowerCaseFilter lcFilter = new LowerCaseFilter(tokenizer); StopFilter stopFilter = new StopFilter(lcFilter, CustomStopWords.STOP_WORDS); Token t = new Token(); while (stopFilter.next(t) != null) { assertFalse(CustomStopWords.STOP_WORDS.contains(t.term())); } Notice that in the first case I look for "describ" in the output token stream but in the second case I look for "describe". Thanks in advance, Behrang Saeedzadeh http://my.opera.com/behrangsa