I want to do some document clustering on a corpus of ~ 100,000
documents, with average doc size being ~ 7k. I have looked into carrot2
but it seems to work only for relatively short documents and has soem
scalign issues for large corpus. Certainly for these kind of corpus
size, one cannot use a pure memory based clustering algorithm. Hence the
possible use of lucene.
I was thinking of using lucene to create the similarity matrix (between
documents). Before adding a document (i.e. D-k) to the lucene index, we
can compute the document similarity between D-k with all other existing
documents by creating a Query out of D-k and doing a search on the
existing index. We can take the score of each document as similarity
measure between the document and D-k. It is going to be a symmetric and
parse matrix. Now we can use this similarity matrix and feed it to any
similarity based clustering algorithm.
Would like to know if anyone has worked along similar lines, and are
happy to share their experiences.
thanks,
Prasen
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]