But as far as I know, it doesn't index the original termtoo (at the same offset), which you have to do if you want to distinguish between the two cases, I think.
But I confess I've been out of the guts of Lucene for some time, so I could be waaaaay off. But you'd sure want to use a different token <G>.... Erick On Wed, Jul 22, 2009 at 4:12 PM, Shai Erera <ser...@gmail.com> wrote: > Actually my stemming Analyzer adds a similar character to stems, to > distinguish between original tokens (like orig=test) to stems (testing --> > test$). > > On Wed, Jul 22, 2009 at 11:02 PM, Erick Erickson <erickerick...@gmail.com > >wrote: > > > A closely related approach to what Shai outlined is to index the > > *original*token > > with a special ender (say $) with a 0 increment (see SynonymAnalyzer > > in LIA). Then, whenever you determined you wanted to use the un-stemmed > > version, just add your token to the terms (i.e. testing$ when you didn't > > want > > the analyzer to match on the stemmed "test"). I'm pretty sure that the > > presence > > of the $ will short-circuit stemming, but you'll have to be sure that > > whatever > > analyzer you use doesn't strip it. > > > > Best > > Erick > > > > On Wed, Jul 22, 2009 at 9:16 AM, Shai Erera <ser...@gmail.com> wrote: > > > > > Hi Robert, > > > > > > What you could do is use the Stemmer (as a TokenFilter I assume) and > > > produce > > > two tokens always - the stem and the original. Index both of them in > the > > > same position. > > > > > > Then tell your users that if they search for [testing], it will find > > > results > > > for 'testing', 'test' etc (the stems) and if they search for > ["testing"] > > it > > > will do an "exact" search for the word 'testing'. > > > > > > If you choose to go this way, you'll need to override QueryParser and > > when > > > you encounter a phrase, don't run it by the Analyzer you use which > stems, > > > but by a different Analyzer, maybe WhitespaceAnalyzer, so that it will > > > produce the word 'testing'. > > > > > > With that approach BTW, you can search on stopwords that are part of > the > > > phrase, given of course that you also indexed them. > > > > > > I'm not aware of any class in Lucene that will allow you to produce two > > > tokens, except may TeeTokenFilter. I wrote my own to do that and it's > > > really > > > not a big thing to do. > > > > > > Hope this helps, > > > > > > Shai > > > > > > On Wed, Jul 22, 2009 at 1:09 PM, Robert Corbett <java....@gmail.com> > > > wrote: > > > > > > > Hello, > > > > > > > > I would like to use a stemming analyser similar to KStem or > PorterStem > > to > > > > provide access to a wider search scope for our users. However, at the > > > same > > > > time I also want to provide the ability for the users to throw out > the > > > > stems > > > > if they want to search more accurately. I have a number of ideas as > to > > > the > > > > best way to implement this. > > > > > > > > I can control the breadth of the search scope with a checkbox on the > > ui. > > > > When the scope is wide, I will use the stems, when its narrow (or > > exact) > > > > I'll avoid using the stems. > > > > > > > > The approach I envisage is to index the fields twice. Once using the > > > > StandardAnalyser and a second time using the Stemmer. I'll attach a > > > suffix > > > > to the name of the stemmed set in the index. So for example, TITLE > > > > (contains > > > > only StandardAnalyser output) and TITLE_STEM (contains the > StemAnalyser > > > > output). When I come to generate the query object, I will first check > > the > > > > search breath on the UI. If its wide, I'll use the TITLE_STEM column > > > > parsing > > > > the query with the StemAnalyser, otherwise I'll use the TITLE column > > with > > > > the query being parsed with the StandardAnalyser. > > > > > > > > Although I appreciate it will result in a much larger index and > longer > > > > indexing time, this approach will allow me to implement the required > > > > functionality. > > > > > > > > I just wanted to check with you guys that there is no better, perhaps > > > more > > > > efficient way of achieving my goals before taking the above approach. > > All > > > > feedback / advice will be warmly received. > > > > > > > > Thanks guys! > > > > > > > > > >