Re: Stemmed terms/common terms

Alf Eaton Thu, 16 Aug 2007 09:16:05 -0700


On 16 Aug 2007, at 17:06, Grant Ingersoll wrote:

On Aug 16, 2007, at 10:17 AM, Alf Eaton wrote:
A couple of questions about term frequencies and stemming:
- What's the best way to get the most common unstemmed form of aPorter-stemmed word from the index? For example given the stem'walk', find that 'walking' is the most common full word in theindex.
Are both in the index? I would think this is going to take someapplication specific logic, since Lucene doesn't inherently trackthese relations. You might be able to string something togetherusing some of the regular expression/wildcard queries, but it isgoing to take some work on your part.

Hmm, no - the stemmed token is indexed and the full field is stored.I guess that means running a search for the stem and then using thesame logic as a highlighter to find and extract the actual terms fromeach document.

Another approach might be to put some mechanisms in place duringanalysis that track this information.

How would you recommend doing this - using positionIncrement to storethe stem and the original word at the same position, perhaps?


alf.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Stemmed terms/common terms

Reply via email to