Re: Document clustering using lucene

Paul Elschot Thu, 15 Jun 2006 05:59:21 -0700

On Thursday 15 June 2006 13:50, Prasenjit Mukherjee wrote:
> I want to do some document  clustering on a corpus of  ~ 100,000 
> documents, with average doc size being ~ 7k. I have looked into carrot2 
> but it seems to work only for relatively short documents and has soem 
> scalign issues for large corpus.  Certainly for these kind of corpus 
> size, one cannot use a pure memory based clustering algorithm. Hence the 
> possible use of lucene.
> 
> I was thinking of using lucene to create the similarity matrix (between 
> documents).  Before adding a document (i.e. D-k) to the lucene index, we 
> can compute the document similarity between D-k with all other existing 
> documents by creating a Query out of D-k and doing a search on the 
> existing index. We can take the score of each document as   similarity 
> measure between the document and D-k. It is going to be a symmetric and 
> parse matrix. Now we can use this similarity  matrix and feed it to any 
> similarity based clustering algorithm.
> 
> Would like to know if anyone has worked along similar lines, and are 
> happy  to share their experiences.


Did you look into indexing a TermVector for each document?
It is easy to compute an element of a similarity matrix from two
term vectors.

Regards,
Paul Elschot

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Document clustering using lucene

Reply via email to