On Aug 8, 2005, at 10:43 AM, Dan Armbrust wrote:
It is my understanding that the StandardAnalyzer will remove underscores - so "some_word" be indexed as 'some' and 'word'.

I want to keep the underscores, so I was thinking of changing over to an Analyzer that uses the WhiteSpaceTokenizer, LowerCaseFilter, and StopFilter.

What other tokenizing magic will I lose by changing away from the StandardAnalyzer?

The best thing you can do is set up a test environment to try out sample text with various analyzers. Lucene in Action's source code (http://www.lucenebook.com) comes with such a demo that you can easily tweak. Here's a sample of running "ant AnalyzerDemo":

     [echo] Running lia.analysis.AnalyzerDemo...
     [java] Analyzing "some_word"
     [java]   WhitespaceAnalyzer:
     [java]     [some_word]

     [java]   SimpleAnalyzer:
     [java]     [some] [word]

     [java]   StopAnalyzer:
     [java]     [some] [word]

     [java]   StandardAnalyzer:
     [java]     [some] [word]

     [java]   SnowballAnalyzer:
     [java]     [some] [word]

     [java]   SnowballAnalyzer:
     [java]     [some] [word]

     [java]   SnowballAnalyzer:
     [java]     [some] [word]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to