I don't use the Lucene stemming Analyzers. My version, if asked to keep the
original tokens, sets the position of both stem and original to be the same,
and adds another character to the stem version.
During query, that Analyzer is usually instructed to not keep the original
tokens, just the stems
But as far as I know, it doesn't index the original termtoo (at the same
offset), which you have to do if you
want to distinguish between the two cases, I think.
But I confess I've been out of the guts of Lucene for some
time, so I could be way off.
But you'd sure want to use a different toke
Actually my stemming Analyzer adds a similar character to stems, to
distinguish between original tokens (like orig=test) to stems (testing -->
test$).
On Wed, Jul 22, 2009 at 11:02 PM, Erick Erickson wrote:
> A closely related approach to what Shai outlined is to index the
> *original*token
> wit
A closely related approach to what Shai outlined is to index the
*original*token
with a special ender (say $) with a 0 increment (see SynonymAnalyzer
in LIA). Then, whenever you determined you wanted to use the un-stemmed
version, just add your token to the terms (i.e. testing$ when you didn't
want
Hi Robert,
What you could do is use the Stemmer (as a TokenFilter I assume) and produce
two tokens always - the stem and the original. Index both of them in the
same position.
Then tell your users that if they search for [testing], it will find results
for 'testing', 'test' etc (the stems) and if