Clustering Carrot2 vs TermVector Analysis

Andrew Boyd Mon, 30 May 2005 08:07:56 -0700

Hi All,
  By using the carrot demo:
http://www.newsarch.com/archive/mailinglist/jakarta/lucene/user/msg03928.html


 I was able to easliy cluster search results based on the fields used by 
carrot( url, title, and summary).  
However I was wondering if there was a way to do something similar using term 
vector analysis and the built in TermVector / Similarity api.

Please bear with me as I'm just learning about term vector analysis mostly from:
http://www.miislita.com/term-vector/term-vector-1.html

Where it discusses wi = tfi * IDFi

I've ordered the book Information Retrieval: Algorithms and Heuristics but it 
has not shown up yet.

Any way here is my question:

After doing a typical lucene search how can I get the  top 5 "key terms" for 
each of the top ten documents.  I was thinking that I sum these and then have a 
type of cluster.

When we do a search we have the query vector that we use to get the similarity 
used for ranking. So when we do a query the query terms are the "key terms".  
If we dont have a query vector is there a way to get the "key terms" from a 
document?  Of course there if tf but every thing I'm reading says that tf is 
not ideal.  So I guess my question boils down to 

     how using the lucene api can I get the top 5 wi= tfi * IDFi of a given 
document.

If you have any suggestions or if I'm off base I'd really appreciate the help.

Thanks,

Andrew

 



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Clustering Carrot2 vs TermVector Analysis

Reply via email to