A closely related approach to what Shai outlined is to index the *original*token with a special ender (say $) with a 0 increment (see SynonymAnalyzer in LIA). Then, whenever you determined you wanted to use the un-stemmed version, just add your token to the terms (i.e. testing$ when you didn't want the analyzer to match on the stemmed "test"). I'm pretty sure that the presence of the $ will short-circuit stemming, but you'll have to be sure that whatever analyzer you use doesn't strip it.
Best Erick On Wed, Jul 22, 2009 at 9:16 AM, Shai Erera <ser...@gmail.com> wrote: > Hi Robert, > > What you could do is use the Stemmer (as a TokenFilter I assume) and > produce > two tokens always - the stem and the original. Index both of them in the > same position. > > Then tell your users that if they search for [testing], it will find > results > for 'testing', 'test' etc (the stems) and if they search for ["testing"] it > will do an "exact" search for the word 'testing'. > > If you choose to go this way, you'll need to override QueryParser and when > you encounter a phrase, don't run it by the Analyzer you use which stems, > but by a different Analyzer, maybe WhitespaceAnalyzer, so that it will > produce the word 'testing'. > > With that approach BTW, you can search on stopwords that are part of the > phrase, given of course that you also indexed them. > > I'm not aware of any class in Lucene that will allow you to produce two > tokens, except may TeeTokenFilter. I wrote my own to do that and it's > really > not a big thing to do. > > Hope this helps, > > Shai > > On Wed, Jul 22, 2009 at 1:09 PM, Robert Corbett <java....@gmail.com> > wrote: > > > Hello, > > > > I would like to use a stemming analyser similar to KStem or PorterStem to > > provide access to a wider search scope for our users. However, at the > same > > time I also want to provide the ability for the users to throw out the > > stems > > if they want to search more accurately. I have a number of ideas as to > the > > best way to implement this. > > > > I can control the breadth of the search scope with a checkbox on the ui. > > When the scope is wide, I will use the stems, when its narrow (or exact) > > I'll avoid using the stems. > > > > The approach I envisage is to index the fields twice. Once using the > > StandardAnalyser and a second time using the Stemmer. I'll attach a > suffix > > to the name of the stemmed set in the index. So for example, TITLE > > (contains > > only StandardAnalyser output) and TITLE_STEM (contains the StemAnalyser > > output). When I come to generate the query object, I will first check the > > search breath on the UI. If its wide, I'll use the TITLE_STEM column > > parsing > > the query with the StemAnalyser, otherwise I'll use the TITLE column with > > the query being parsed with the StandardAnalyser. > > > > Although I appreciate it will result in a much larger index and longer > > indexing time, this approach will allow me to implement the required > > functionality. > > > > I just wanted to check with you guys that there is no better, perhaps > more > > efficient way of achieving my goals before taking the above approach. All > > feedback / advice will be warmly received. > > > > Thanks guys! > > >