Re: Clustering Carrot2 vs TermVector Analysis

Dawid Weiss Wed, 01 Jun 2005 01:23:57 -0700


Hi Andrew,


Coming up with an answer... sorry for the delay.

By using the carrot demo:http://www.newsarch.com/archive/mailinglist/jakarta/lucene/user/msg03928.html



I was able to easliy cluster search results based on the fields used
by carrot( url, title, and summary). However I was wondering if there
was a way to do something similar using term vector analysis and the
built in TermVector / Similarity api.


Yes, most clustering methods are based just on that (term-vector
matrix). Carrot also uses this internally, but builds its own data
structure from the provided data instead of relying on Lucene's. It
shouldn't be a problem to write a clustering plugin to Carrot that
actually uses the term-vector data from Lucene.

After doing a typical lucene search how can I get the  top 5 "key
terms" for each of the top ten documents.  I was thinking that I sum
these and then have a type of cluster.

The question is ill-defined, I'm afraid. "top 5 key terms" are verysubjecting and depend on the strategy of score calculation, the wayyou're pruning stop words, etc.

I also don't get the: "each of the top ten documents". Do you mean: eachof the ten top documents within a cluster?

D.

P.S. Please CC me directly; I read mails to newsgroups in batches everyfew days.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Clustering Carrot2 vs TermVector Analysis

Reply via email to