RE: Implicit Stopping in StandardTokenizer??

Max Pfingsthorn Mon, 20 Jun 2005 08:48:59 -0700

Hi!

Here is the code:



import java.io.Reader;

import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.standard.StandardFilter;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;

public class SimpleStandardAnalyzer extends Analyzer {
  public SimpleStandardAnalyzer() {
  }

  public TokenStream tokenStream(String fieldName, Reader reader) {
    return new LowerCaseFilter(
      new StandardFilter(
        new StandardTokenizer(reader)
          )
        );
  }
}

I need it to be easily dynamically loaded via Class.forName() because we use it 
in a xml-configured environment (i.e. Avalon-like). So, passing extra stuff to 
constructors is not really what I am looking for. However, I guess I could make 
a wrapper like this:

public class SimpleStandardAnalyzer extends StandardAnalyzer {

  public SimpleStandardAnalyzer()
  {
    super(new String[0]);
  }
}

I tried this too, but still the same effect. "This" and "is", etc, get filtered 
out even with no stopwords set. Any ideas?

Thanks a lot!
max


-----Original Message-----
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Monday, June 20, 2005 16:57
To: java-user@lucene.apache.org
Subject: Re: Implicit Stopping in StandardTokenizer??



On Jun 20, 2005, at 10:41 AM, Max Pfingsthorn wrote:

> Hi!
>
> I've been trying to make an Analyzer which works like the  
> StandardAnalyzer but without stopping. For some reason though, I  
> still don't get words like "is" or "a" out of it... I checked with  
> Luke (one doc in one index with the contents  
> "hello,this,is,a,keyword,hello!,nicetomeetyou". This should  
> tokenize into "hello this is a keyword hello nicetomeetyou", but  
> actually it does "hello keyword hello nicetomeetyou". Does anyone  
> know why it drops those extra terms?

Show us the code of your analyzer.

If all you want is StandardAnalyzer to not remove stop words, you can  
construct it this way:

     analyzer = new StandardAnalyzer(new String[] {});

The String[] are the stop words to remove, in this case none.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Implicit Stopping in StandardTokenizer??

Reply via email to