PorterStemFilter bug?

Behrang Saeedzadeh Sun, 22 Mar 2009 00:25:50 -0700

(( A complete Mavenized project illustration this problem is available at:
https://dl.getdropbox.com/u/34024/lucene-porter-stemmer-bug.zip ))


I am using Lucene to tokenzize, filter, and stem an input text using the
following chain:

String text =
IOUtils.toString(getClass().getClassLoader().getSystemResourceAsStream("J01-1001.txt"));
StringReader reader = new StringReader(text);
StandardTokenizer tokenizer = new StandardTokenizer(reader);
LowerCaseFilter lcFilter = new LowerCaseFilter(tokenizer);
StopFilter stopFilter = new StopFilter(lcFilter,
CustomStopWords.STOP_WORDS);
PorterStemFilter stemmer = new PorterStemFilter(stopFilter);

Set<String> stemmedStopWords = buildStemmedStopWords();

Token t = new Token();
while (stemmer.next(t) != null) {
    if (stemmedStopWords.contains(t.term())) {
        throw new RuntimeException("\"" + t.term() + "\" must have been
removed from the output token stream but is not");
    }
}

However I see some of the stemmed stop words still appearing in the output
tokens. For example, "describe" is one of my custom stop words. The stemmed
version of describe is "describ". However "describ" appears in the output
token stream instead of being filtered out.

I thought "describe" will be filtered out by the stopFilter even before
reaching the stemmer. However, somehow it is leaking into the stemmer and it
is turned into "describ" by it and goes into the output token stream.

The funny thing is, when I remove the stemmer from the chain, then
"describe" does not go into the output token stream:

String text =
IOUtils.toString(getClass().getClassLoader().getSystemResourceAsStream("J01-1001.txt"));
StringReader reader = new StringReader(text);
StandardTokenizer tokenizer = new StandardTokenizer(reader);
LowerCaseFilter lcFilter = new LowerCaseFilter(tokenizer);
StopFilter stopFilter = new StopFilter(lcFilter,
CustomStopWords.STOP_WORDS);

Token t = new Token();
while (stopFilter.next(t) != null) {
    assertFalse(CustomStopWords.STOP_WORDS.contains(t.term()));
}

Notice that in the first case I look for "describ" in the output token
stream but in the second case I look for "describe".

Thanks in advance,
Behrang Saeedzadeh
http://my.opera.com/behrangsa

PorterStemFilter bug?

Reply via email to