Wow! Too many grammar errors and typos! Sorry about that :D Behrang Saeedzadeh http://my.opera.com/behrangsa
On Sun, Mar 22, 2009 at 6:25 PM, Behrang Saeedzadeh <behran...@gmail.com>wrote: > (( A complete Mavenized project illustration this problem is available at: > https://dl.getdropbox.com/u/34024/lucene-porter-stemmer-bug.zip )) > > I am using Lucene to tokenzize, filter, and stem an input text using the > following chain: > > String text = > IOUtils.toString(getClass().getClassLoader().getSystemResourceAsStream("J01-1001.txt")); > StringReader reader = new StringReader(text); > StandardTokenizer tokenizer = new StandardTokenizer(reader); > LowerCaseFilter lcFilter = new LowerCaseFilter(tokenizer); > StopFilter stopFilter = new StopFilter(lcFilter, > CustomStopWords.STOP_WORDS); > PorterStemFilter stemmer = new PorterStemFilter(stopFilter); > > Set<String> stemmedStopWords = buildStemmedStopWords(); > > Token t = new Token(); > while (stemmer.next(t) != null) { > if (stemmedStopWords.contains(t.term())) { > throw new RuntimeException("\"" + t.term() + "\" must have been > removed from the output token stream but is not"); > } > } > > However I see some of the stemmed stop words still appearing in the output > tokens. For example, "describe" is one of my custom stop words. The stemmed > version of describe is "describ". However "describ" appears in the output > token stream instead of being filtered out. > > I thought "describe" will be filtered out by the stopFilter even before > reaching the stemmer. However, somehow it is leaking into the stemmer and it > is turned into "describ" by it and goes into the output token stream. > > The funny thing is, when I remove the stemmer from the chain, then > "describe" does not go into the output token stream: > > String text = > IOUtils.toString(getClass().getClassLoader().getSystemResourceAsStream("J01-1001.txt")); > StringReader reader = new StringReader(text); > StandardTokenizer tokenizer = new StandardTokenizer(reader); > LowerCaseFilter lcFilter = new LowerCaseFilter(tokenizer); > StopFilter stopFilter = new StopFilter(lcFilter, > CustomStopWords.STOP_WORDS); > > Token t = new Token(); > while (stopFilter.next(t) != null) { > assertFalse(CustomStopWords.STOP_WORDS.contains(t.term())); > } > > Notice that in the first case I look for "describ" in the output token > stream but in the second case I look for "describe". > > Thanks in advance, > Behrang Saeedzadeh > http://my.opera.com/behrangsa >