Wow! Too many grammar errors and typos! Sorry about that :D
Behrang Saeedzadeh
http://my.opera.com/behrangsa


On Sun, Mar 22, 2009 at 6:25 PM, Behrang Saeedzadeh <behran...@gmail.com>wrote:

> (( A complete Mavenized project illustration this problem is available at:
> https://dl.getdropbox.com/u/34024/lucene-porter-stemmer-bug.zip ))
>
> I am using Lucene to tokenzize, filter, and stem an input text using the
> following chain:
>
> String text =
> IOUtils.toString(getClass().getClassLoader().getSystemResourceAsStream("J01-1001.txt"));
>  StringReader reader = new StringReader(text);
> StandardTokenizer tokenizer = new StandardTokenizer(reader);
>  LowerCaseFilter lcFilter = new LowerCaseFilter(tokenizer);
> StopFilter stopFilter = new StopFilter(lcFilter,
> CustomStopWords.STOP_WORDS);
>  PorterStemFilter stemmer = new PorterStemFilter(stopFilter);
>
> Set<String> stemmedStopWords = buildStemmedStopWords();
>
> Token t = new Token();
> while (stemmer.next(t) != null) {
>     if (stemmedStopWords.contains(t.term())) {
>         throw new RuntimeException("\"" + t.term() + "\" must have been
> removed from the output token stream but is not");
>     }
> }
>
> However I see some of the stemmed stop words still appearing in the output
> tokens. For example, "describe" is one of my custom stop words. The stemmed
> version of describe is "describ". However "describ" appears in the output
> token stream instead of being filtered out.
>
> I thought "describe" will be filtered out by the stopFilter even before
> reaching the stemmer. However, somehow it is leaking into the stemmer and it
> is turned into "describ" by it and goes into the output token stream.
>
> The funny thing is, when I remove the stemmer from the chain, then
> "describe" does not go into the output token stream:
>
> String text =
> IOUtils.toString(getClass().getClassLoader().getSystemResourceAsStream("J01-1001.txt"));
>  StringReader reader = new StringReader(text);
> StandardTokenizer tokenizer = new StandardTokenizer(reader);
>  LowerCaseFilter lcFilter = new LowerCaseFilter(tokenizer);
> StopFilter stopFilter = new StopFilter(lcFilter,
> CustomStopWords.STOP_WORDS);
>
> Token t = new Token();
> while (stopFilter.next(t) != null) {
>     assertFalse(CustomStopWords.STOP_WORDS.contains(t.term()));
> }
>
> Notice that in the first case I look for "describ" in the output token
> stream but in the second case I look for "describe".
>
> Thanks in advance,
> Behrang Saeedzadeh
> http://my.opera.com/behrangsa
>

Reply via email to