Re: Lucene - Search breadth approach

Shai Erera Wed, 22 Jul 2009 21:09:55 -0700

I don't use the Lucene stemming Analyzers. My version, if asked to keep the
original tokens, sets the position of both stem and original to be the same,
and adds another character to the stem version.


During query, that Analyzer is usually instructed to not keep the original
tokens, just the stems.

Shai

On Wed, Jul 22, 2009 at 11:40 PM, Erick Erickson <erickerick...@gmail.com>wrote:

> But as far as I know, it doesn't index the original termtoo (at the same
> offset), which you have to do if you
> want to distinguish between the two cases, I think.
>
> But I confess I've been out of the guts of Lucene for some
> time, so I could be waaaaay off.
>
> But you'd sure want to use a different token <G>....
>
> Erick
>
>
> On Wed, Jul 22, 2009 at 4:12 PM, Shai Erera <ser...@gmail.com> wrote:
>
> > Actually my stemming Analyzer adds a similar character to stems, to
> > distinguish between original tokens (like orig=test) to stems (testing
> -->
> > test$).
> >
> > On Wed, Jul 22, 2009 at 11:02 PM, Erick Erickson <
> erickerick...@gmail.com
> > >wrote:
> >
> > > A closely related approach to what Shai outlined is to index the
> > > *original*token
> > > with a special ender (say $) with a 0 increment (see SynonymAnalyzer
> > > in LIA). Then, whenever you determined you wanted to use the un-stemmed
> > > version, just add your token to the terms (i.e. testing$ when you
> didn't
> > > want
> > > the analyzer to match on the stemmed "test"). I'm pretty sure that the
> > > presence
> > > of the $ will short-circuit stemming, but you'll have to be sure that
> > > whatever
> > > analyzer you use doesn't strip it.
> > >
> > > Best
> > > Erick
> > >
> > > On Wed, Jul 22, 2009 at 9:16 AM, Shai Erera <ser...@gmail.com> wrote:
> > >
> > > > Hi Robert,
> > > >
> > > > What you could do is use the Stemmer (as a TokenFilter I assume) and
> > > > produce
> > > > two tokens always - the stem and the original. Index both of them in
> > the
> > > > same position.
> > > >
> > > > Then tell your users that if they search for [testing], it will find
> > > > results
> > > > for 'testing', 'test' etc (the stems) and if they search for
> > ["testing"]
> > > it
> > > > will do an "exact" search for the word 'testing'.
> > > >
> > > > If you choose to go this way, you'll need to override QueryParser and
> > > when
> > > > you encounter a phrase, don't run it by the Analyzer you use which
> > stems,
> > > > but by a different Analyzer, maybe WhitespaceAnalyzer, so that it
> will
> > > > produce the word 'testing'.
> > > >
> > > > With that approach BTW, you can search on stopwords that are part of
> > the
> > > > phrase, given of course that you also indexed them.
> > > >
> > > > I'm not aware of any class in Lucene that will allow you to produce
> two
> > > > tokens, except may TeeTokenFilter. I wrote my own to do that and it's
> > > > really
> > > > not a big thing to do.
> > > >
> > > > Hope this helps,
> > > >
> > > > Shai
> > > >
> > > > On Wed, Jul 22, 2009 at 1:09 PM, Robert Corbett <java....@gmail.com>
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I would like to use a stemming analyser similar to KStem or
> > PorterStem
> > > to
> > > > > provide access to a wider search scope for our users. However, at
> the
> > > > same
> > > > > time I also want to provide the ability for the users to throw out
> > the
> > > > > stems
> > > > > if they want to search more accurately. I have a number of ideas as
> > to
> > > > the
> > > > > best way to implement this.
> > > > >
> > > > > I can control the breadth of the search scope with a checkbox on
> the
> > > ui.
> > > > > When the scope is wide, I will use the stems, when its narrow (or
> > > exact)
> > > > > I'll avoid using the stems.
> > > > >
> > > > > The approach I envisage is to index the fields twice. Once using
> the
> > > > > StandardAnalyser and a second time using the Stemmer. I'll attach a
> > > > suffix
> > > > > to the name of the stemmed set in the index. So for example, TITLE
> > > > > (contains
> > > > > only StandardAnalyser output) and TITLE_STEM (contains the
> > StemAnalyser
> > > > > output). When I come to generate the query object, I will first
> check
> > > the
> > > > > search breath on the UI. If its wide, I'll use the TITLE_STEM
> column
> > > > > parsing
> > > > > the query with the StemAnalyser, otherwise I'll use the TITLE
> column
> > > with
> > > > > the query being parsed with the StandardAnalyser.
> > > > >
> > > > > Although I appreciate it will result in a much larger index and
> > longer
> > > > > indexing time, this approach will allow me to implement the
> required
> > > > > functionality.
> > > > >
> > > > > I just wanted to check with you guys that there is no better,
> perhaps
> > > > more
> > > > > efficient way of achieving my goals before taking the above
> approach.
> > > All
> > > > > feedback / advice will be warmly received.
> > > > >
> > > > > Thanks guys!
> > > > >
> > > >
> > >
> >
>

Re: Lucene - Search breadth approach

Reply via email to