On 16 Aug 2007, at 17:06, Grant Ingersoll wrote:


On Aug 16, 2007, at 10:17 AM, Alf Eaton wrote:

A couple of questions about term frequencies and stemming:

- What's the best way to get the most common unstemmed form of a Porter-stemmed word from the index? For example given the stem 'walk', find that 'walking' is the most common full word in the index.

Are both in the index? I would think this is going to take some application specific logic, since Lucene doesn't inherently track these relations. You might be able to string something together using some of the regular expression/wildcard queries, but it is going to take some work on your part.

Hmm, no - the stemmed token is indexed and the full field is stored. I guess that means running a search for the stem and then using the same logic as a highlighter to find and extract the actual terms from each document.

Another approach might be to put some mechanisms in place during analysis that track this information.

How would you recommend doing this - using positionIncrement to store the stem and the original word at the same position, perhaps?

alf.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to