Seeing "www.fubar.com" in the index means that your analyzer returns it as a single token. To strip out "www" and "com", you have to use an analyzer that returns tokens as "www", "fubar" and " com".
Try to use a different analyzer( or write your own as below ). //a C# example public class LetterOrDigitAnalyzer : Analyzer { public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader) { TokenStream t = new LetterOrDigitTokenizer(reader); t = new LowerCaseFilter(t); return t; } } public class LetterOrDigitTokenizer : CharTokenizer { public LetterOrDigitTokenizer(TextReader input) : base(input) { } protected override bool IsTokenChar(char c) { return char.IsLetterOrDigit(c); } } DIGY -----Original Message----- From: jchang [mailto:jchangkihat...@gmail.com] Sent: Tuesday, February 02, 2010 11:16 PM To: java-user@lucene.apache.org Subject: Re: Can't get tokenization/stop works working I am using org.apache.lucene.analysis.snowball.SnowballAnalyzer. Looking through luke, I see that www.fubar.com was indexed, not fubar. So, clearly, I'm not stripping out the stop words of www and com. Any ideas? -- View this message in context: http://old.nabble.com/Can%27t-get-tokenization-stop-works-working-tp27400546 p27427519.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org