Document clustering using lucene

Prasenjit Mukherjee Thu, 15 Jun 2006 05:26:30 -0700

I want to do some document clustering on a corpus of ~ 100,000documents, with average doc size being ~ 7k. I have looked into carrot2but it seems to work only for relatively short documents and has soemscalign issues for large corpus. Certainly for these kind of corpussize, one cannot use a pure memory based clustering algorithm. Hence thepossible use of lucene.

I was thinking of using lucene to create the similarity matrix (betweendocuments). Before adding a document (i.e. D-k) to the lucene index, wecan compute the document similarity between D-k with all other existingdocuments by creating a Query out of D-k and doing a search on theexisting index. We can take the score of each document as similaritymeasure between the document and D-k. It is going to be a symmetric andparse matrix. Now we can use this similarity matrix and feed it to anysimilarity based clustering algorithm.

Would like to know if anyone has worked along similar lines, and arehappy to share their experiences.


thanks,
Prasen



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Document clustering using lucene

Reply via email to