Hi All, By using the carrot demo: http://www.newsarch.com/archive/mailinglist/jakarta/lucene/user/msg03928.html
I was able to easliy cluster search results based on the fields used by carrot( url, title, and summary). However I was wondering if there was a way to do something similar using term vector analysis and the built in TermVector / Similarity api. Please bear with me as I'm just learning about term vector analysis mostly from: http://www.miislita.com/term-vector/term-vector-1.html Where it discusses wi = tfi * IDFi I've ordered the book Information Retrieval: Algorithms and Heuristics but it has not shown up yet. Any way here is my question: After doing a typical lucene search how can I get the top 5 "key terms" for each of the top ten documents. I was thinking that I sum these and then have a type of cluster. When we do a search we have the query vector that we use to get the similarity used for ranking. So when we do a query the query terms are the "key terms". If we dont have a query vector is there a way to get the "key terms" from a document? Of course there if tf but every thing I'm reading says that tf is not ideal. So I guess my question boils down to how using the lucene api can I get the top 5 wi= tfi * IDFi of a given document. If you have any suggestions or if I'm off base I'd really appreciate the help. Thanks, Andrew --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]