Re: Stemming - limited index expansion

Jack Krupansky Tue, 12 Jun 2012 13:14:57 -0700

I don't completely follow precisely what you want to do, but theWordDelimiterFilter is an example of a token filter that outputs an extratoken at the same position, such as with its CATENATE_ALL/WORDS/NUMBERSoptions.


https://builds.apache.org/job/Lucene-trunk/javadoc/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterFilter.html

For example, given the input "wi-fi", it would output "wi" with position 0,"fi" with position 1, and "wifi" also with position 0.

Or, with its PRESERVE_ORIGINAL option, that same input would output "wi" at0, "fi" at 1, and "wi-fi" at 0.

That said, maybe you could clarify your specific intent with an example.Maybe you simple want to internally call some existing stemmer filter andoutput both the original and stemmed term at the same location?


-- Jack Krupansky

-----Original Message-----From: Paul Hill

Sent: Tuesday, June 12, 2012 3:07 PM
To: java-user@lucene.apache.org
Subject: Stemming - limited index expansion

As others have previously proposed on this list, I am interesting ininserting a second token at some positions in my index. I'll call thisLimited Index Expansion.I want to retain the original token, so that I can score an original wordthat matches in a text better than just any synonym/stem etc. Maybe I'lleven do this with payloads (on the 2nd token?).If I didn't keep the original word all I would be doing is a limited indextime "reduction". Saving the original word and sometimes a lemma/stem (orsomething else), means I anticipate at most two tokens at a position in theindex.

I couldn't find a nearly-right high-level Filter that I could use to addlogic to call a stemmer and conditionally add another token. Anysuggestions?One idea I had is that adding a second token is much like what aSynonymFilter does, but yikes I was starting to grok PendingInputs,PendingOutputs,but wasn't getting very far reading through SynonymMap and its BytesRefHashetc. Obviously it is written to be very good with memory very and fast, butit looks a bit tricky to extend for other sources of "synonyms". It is toobad that the get synonym part of the operation is not encapsulated insomething pluggable or overridable, so I could just return an appropriatearray of CharRefs. The SynonymFilter is final anyway.

Can anyone point me toward any existing high-level filter that I could useby sub-classing, modifying, plugging, or just as a good example to which Imight add my additional code to add another token?Building Filters is new to me, but right now nothing is jumping out at me asa basis for such a Filter. Any suggestions? Did I miss something in coreor contrib?Is there some other combination of buffering, copying, sinking etc filtersthat I'm missing what I should use to build a filter chain that would aidthis process?

-Paul


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Stemming - limited index expansion

Reply via email to