Responses inline prefixed with **** -----Original Message----- From: Dawid Weiss <[EMAIL PROTECTED]> Sent: Jun 1, 2005 3:24 AM To: java-user@lucene.apache.org Subject: Re: Clustering Carrot2 vs TermVector Analysis
Hi Andrew, Coming up with an answer... sorry for the delay. > By using the carrot demo: > http://www.newsarch.com/archive/mailinglist/jakarta/lucene/user/msg03928.html > > > I was able to easliy cluster search results based on the fields used > by carrot( url, title, and summary). However I was wondering if there > was a way to do something similar using term vector analysis and the > built in TermVector / Similarity api. Yes, most clustering methods are based just on that (term-vector matrix). Carrot also uses this internally, but builds its own data structure from the provided data instead of relying on Lucene's. It shouldn't be a problem to write a clustering plugin to Carrot that actually uses the term-vector data from Lucene. > After doing a typical lucene search how can I get the top 5 "key > terms" for each of the top ten documents. I was thinking that I sum > these and then have a type of cluster. The question is ill-defined, I'm afraid. "top 5 key terms" are very subjecting and depend on the strategy of score calculation, the way you're pruning stop words, etc. **** What I meant was the top 5 terms of a document where top is based on wi = tfi * IDFi I also don't get the: "each of the top ten documents". Do you mean: each of the ten top documents within a cluster? **** with the carrot demo you used the top 100 documents returned from the query. I really meant the top n documents from the query. **** I hope I'm more clear now. Thanks for the response. Andrew D. P.S. Please CC me directly; I read mails to newsgroups in batches every few days. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]