I want to do some document clustering on a corpus of ~ 100,000 documents, with average doc size being ~ 7k. I have looked into carrot2 but it seems to work only for relatively short documents and has soem scalign issues for large corpus. Certainly for these kind of corpus size, one cannot use a pure memory based clustering algorithm. Hence the possible use of lucene.

I was thinking of using lucene to create the similarity matrix (between documents). Before adding a document (i.e. D-k) to the lucene index, we can compute the document similarity between D-k with all other existing documents by creating a Query out of D-k and doing a search on the existing index. We can take the score of each document as similarity measure between the document and D-k. It is going to be a symmetric and parse matrix. Now we can use this similarity matrix and feed it to any similarity based clustering algorithm.

Would like to know if anyone has worked along similar lines, and are happy to share their experiences.

thanks,
Prasen



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to