RE: Can't get tokenization/stop works working

Digy Tue, 02 Feb 2010 13:47:19 -0800

Seeing "www.fubar.com" in the index means that your analyzer returns it as a
single token. To strip out "www" and "com", you have to use an analyzer that
returns tokens as "www", "fubar" and " com".


Try to use a different analyzer( or write your own  as below ).

 

    //a C# example

    public class LetterOrDigitAnalyzer : Analyzer

    {

        public override TokenStream TokenStream(string fieldName,
System.IO.TextReader reader)

        {

            TokenStream t = new LetterOrDigitTokenizer(reader);

            t = new LowerCaseFilter(t);

            return t;

        }

    }

 

    public class LetterOrDigitTokenizer : CharTokenizer

    {

        public LetterOrDigitTokenizer(TextReader input) : base(input)

        {

        }

 

        protected override bool IsTokenChar(char c)

        {

            return char.IsLetterOrDigit(c);

        }

    }

 

 

DIGY

 

-----Original Message-----
From: jchang [mailto:jchangkihat...@gmail.com] 
Sent: Tuesday, February 02, 2010 11:16 PM
To: java-user@lucene.apache.org
Subject: Re: Can't get tokenization/stop works working

 

 

I am using org.apache.lucene.analysis.snowball.SnowballAnalyzer.

 

Looking through luke, I see that www.fubar.com was indexed, not fubar.  So,

clearly, I'm not stripping out the stop words of www and com.  Any ideas?

 

 

-- 

View this message in context:
http://old.nabble.com/Can%27t-get-tokenization-stop-works-working-tp27400546
p27427519.html

Sent from the Lucene - Java Users mailing list archive at Nabble.com.

 

 

---------------------------------------------------------------------

To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org

For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Can't get tokenization/stop works working

Reply via email to