Re: top n words within a results set?

Chris Brown Wed, 11 Jan 2006 07:50:06 -0800

Bear with me, I might be missing something.... My documents get indexed (writer.addDocument(doc) ) with one IndexWriter created using one Analyzer(the SnowballAnalyzer). So unless you can somehow use a different Analyzerper field I don't see how the second field will help. If I get theTermFreqVector for a field for a document that was indexed using theSnowballAnalyzer, isn't it always going to return stemmed words?

To confirm your assumption, I suppose I am trying to display the values ofthe indexed field. It doesn't matter to me whether I count "party" and"parties" as separate words or not but I cannot display "parti" to a user asit's not a word.

I'm thinking I need a separate index with the field created using theStandardAnalyzer unless there's some other trick with mixing Analyzers I'munaware of.


Thanks again for your help,
Chris

----- Original Message -----From: "Grant Ingersoll" <[EMAIL PROTECTED]>

To: <java-user@lucene.apache.org>
Sent: Wednesday, January 11, 2006 8:32 AM
Subject: Re: top n words within a results set?

I believe the usual solution is to have a separate field on the samedocument for display purposes (I am assumming you are trying to display thevalues of the indexed field) that is not stemmed. The tradeoff is in diskspace, of course.
Chris Brown wrote:
Okay, I've taken Grant's advice and aggregated the TermFreqVector's for
each term in the applicable field. It works quite well, there's just one
glitch.
Some words like "party" and "picture" appear as "parti" and "pictur". Iam
using the SnowballAnalyzer, I suspect that's what's changing the words.
Short of maintaining a second index using a different analyzer, doesanyone
have any ideas?

----- Original Message ----- From: "Grant Ingersoll" <[EMAIL PROTECTED]>
To: <java-user@lucene.apache.org>
Sent: Monday, January 09, 2006 12:34 PM
Subject: Re: top n words within a results set?
You could use term vectors to accomplish this. Get your hits for thewebsite, then load the term vector for the field containing the keywordsand add up the frequencies
Chris Brown wrote:
Hello,
Is it possible to retrieve the top 'n' most often appearing wordswithin a search criteria? I've seen the High Frequency Terms code inthe sandbox but it works across the whole index.
To put this question into context: We're developing website that hostsa user's photo website. Searches can be specific to a particular user'swebsite or be performed globally across one, many or all websites. I'veaccomplished this with a field in the index called website. What I'dlike to do is give each user the top ten words that appear on theirwebsite.
Thanks,
Chris Brown

http://www.orangepics.com/
--
-------------------------------------------------------------------Grant Ingersoll Sr. Software Engineer Center for Natural LanguageProcessing Syracuse University School of Information Studies 337 HindsHall Syracuse, NY 13244
http://www.cnlp.org Voice:  315-443-5484 Fax: 315-443-6886

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--
-------------------------------------------------------------------Grant Ingersoll Sr. Software Engineer Center for Natural LanguageProcessing Syracuse University School of Information Studies 337 HindsHall Syracuse, NY 13244
http://www.cnlp.org Voice:  315-443-5484 Fax: 315-443-6886

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: top n words within a results set?

Reply via email to