RE: Indexing puncutation

Aigner, Thomas Tue, 28 Jun 2005 11:55:59 -0700

Thanks for the info Chris.


I'd thought I'd provide some more infomation.  One problem is the
descriptions are not easily formatted. In other words, the description
doesn't follow a certain set of rules (num num - alpha alpha etc).  They
are literally anything a supplier has put in for them.  

 

The example below (21-MA-GAB) is stored differently by these analyzers:

WhitespaceAnalyzer:     [21-MA-GAB]

SimpleAnalyzer:         [ma] [gab]

StopAnalyzer:           [ma] [gab]

StandardAnalyzer:       [21-ma] [gab]

SynonymAnalyzer:        [21-ma] [gab]

      (One I created for synonyms.. much like the standard one)

SnowballAnalyzer:       [21-ma] [gab]

 

My problem is searching for 21magab returns nothing as well as 21ma*
etc..

      

This is just one of my punctuation problems.. there can be "" for inches
and 1/2 items etc..

 

I am currently using my SynonymnAnalyzer for some aliases to build the
index and the SnowballAnalyzer to query the index (nice stemming in it)

 

Tom

 

-----Original Message-----
From: Chris D [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, June 28, 2005 2:41 PM
To: java-user@lucene.apache.org
Subject: Re: Indexing puncutation

 

On 6/28/05, Aigner, Thomas <[EMAIL PROTECTED]> wrote:

> Hello all,

> 

>         I am VERY new to Lucene and we are trying out Lucene to see if

> it will accomplish the vast majority of our search functions.

> 

>         I have a question about a good way to index some of our
product

> description codes.  We have description codes like 21-MA-GAB and other

> punctuation.  Our users need to be able to search for "21 MA GAB" or

> "21-MA_GAB" or "21MAGAB".  Is the best way to accomplish this by

> creating synonyms for the 3 different ways when punctuation is in
parts

> to search for? I know I can stop punctuation in the index but what
about

> grouping the information together or with spaces?

> 

> Thanks all in advance,

> Tom

 

There is a couple ways to do this, and I'm not sure which would be

best. (I'm also fairly new to lucene)

 

You can create a grammar that recognizes your product codes (see

StandardAnalyzer code for examples on how to do that) then use a

custom filter to normalize everything.

 

Forgive my poor lex but general idea

 

| <CODE: <NUM><NUM>  ("-"|"_"|""|" ") <ALPHA>+ ("-"|"_"|""|" ") <ALPHA>+
>

 

Then in the filter, normalize to strip out all of the punctuation.

This can be done with a regex or something faster but just for

reference.

 

   if (type == CODE_TYPE) {

      return new org.apache.lucene.analysis.Token(text.replaceAll("-",

""), t.startOffset(), t.endOffset(), type);

   } ... 

 

See StandardAnalyzer, it has a lot of code that would do what you need

and you can copy, paste and edit.

 

You could also do synonyms but that seems like it would be more
overhead.

 

If you think of a better way, let me know, I have to do something
similar.

 

Cheers,

Chris

 

---------------------------------------------------------------------

To unsubscribe, e-mail: [EMAIL PROTECTED]

For additional commands, e-mail: [EMAIL PROTECTED]

RE: Indexing puncutation

Reply via email to