Stemming - limited index expansion

Paul Hill Tue, 12 Jun 2012 12:07:58 -0700

As others have previously proposed on this list, I am interesting in inserting 
a second token at some positions in my index.  I'll call this Limited Index 
Expansion.
I want to retain the original token, so that I can score an original word that 
matches in a text better than just any synonym/stem etc.  Maybe I'll even do 
this with payloads (on the 2nd token?).
If I didn't keep the original word all I would be doing is a limited index time 
"reduction".  Saving the original word and sometimes a lemma/stem (or something 
else), means I anticipate at most two tokens at a position in the index.


I couldn't find a nearly-right high-level Filter that I could use to add logic 
to call a stemmer and conditionally add another token.  Any suggestions?
One idea I had is that adding a second token is much like what a SynonymFilter 
does, but yikes I was starting to grok PendingInputs, PendingOutputs,
but wasn't getting very far reading through SynonymMap and its BytesRefHash 
etc.  Obviously it is written to be very good with memory very and fast, but it 
looks a bit tricky to extend for other sources of "synonyms". It is too bad 
that the get synonym part of the operation is not encapsulated in something 
pluggable or overridable, so I could just return an appropriate array of 
CharRefs.  The SynonymFilter is final anyway.

Can anyone point me toward any existing high-level filter that I could use by 
sub-classing, modifying, plugging, or just as a good example to which I might 
add my additional code to add another token?
Building Filters is new to me, but right now nothing is jumping out at me as a 
basis for such a Filter.  Any suggestions?  Did I miss something in core or 
contrib?
Is there some other combination of buffering, copying, sinking etc filters that 
I'm missing what I should use to build a filter chain that would aid this 
process?

-Paul

Stemming - limited index expansion

Reply via email to