On Thursday 15 June 2006 13:50, Prasenjit Mukherjee wrote: > I want to do some document clustering on a corpus of ~ 100,000 > documents, with average doc size being ~ 7k. I have looked into carrot2 > but it seems to work only for relatively short documents and has soem > scalign issues for large corpus. Certainly for these kind of corpus > size, one cannot use a pure memory based clustering algorithm. Hence the > possible use of lucene. > > I was thinking of using lucene to create the similarity matrix (between > documents). Before adding a document (i.e. D-k) to the lucene index, we > can compute the document similarity between D-k with all other existing > documents by creating a Query out of D-k and doing a search on the > existing index. We can take the score of each document as similarity > measure between the document and D-k. It is going to be a symmetric and > parse matrix. Now we can use this similarity matrix and feed it to any > similarity based clustering algorithm. > > Would like to know if anyone has worked along similar lines, and are > happy to share their experiences.
Did you look into indexing a TermVector for each document? It is easy to compute an element of a similarity matrix from two term vectors. Regards, Paul Elschot --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]