Re: Handling hyphens and other puncuation in proper nouns

Erick Erickson Wed, 24 May 2006 17:18:28 -0700

There are several analyzers provided with Lucene that you could check out.
SimpleAnalyzer, WhitespaceAnalyzer and KeywordAnalyzer all come to mind.
Certainly WhitespaceAnalyzer won't break at the hyphen etc.


NOTE: be sure you pay attention to what analyzer is used if you are using
QueryParser, since the terms in the query are analyzed too.

I've had fun with PerFieldAnalyzerWrapper to handle different analyzers for
different fields if that's something you want to do.

See Analyzer in the JavaDoc, it lists "all known subclasses" which will lead
you to the ones mentioned above plus quite a few others. PatternAnalyzer
works with regular expressions. How cool is that?

You *may* want to get into your own analyzer and/or pre-processing the
tokens before indexing and/or before querying. For instance, should O'Brian
match OBrian? (notice the apostrophe, it may not be obvious depending on
your font).

Best
Erick

Re: Handling hyphens and other puncuation in proper nouns

Reply via email to