You're right. I want document clustering precisely the documents that are already in the index. I don't know much about Mahout project, but it seems that it doesn't help much. What I want is simply to group together similar documents according to their similarity distance of the term vectors.
Anyway, thank you Otis and Grant for your suggestions. I appreciate them. Regards, Supheakmungkol ----- Original Message ---- From: Grant Ingersoll <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Friday, May 16, 2008 7:22:39 PM Subject: Re: Document clustering with Lucene Do you want search result clustering or document clustering? My understanding of Carrot2 is it isn't designed for the latter. The difference being it is designed to work off of shorter snippets of text, as opposed to the whole document. FWIW, you _might_ find some help over on the Mahout project (http://lucene.apache.org/mahout) in terms of different approaches. We have a couple of clustering approaches implemented there, but they just work off a matrix and it is up to you to fill the matrix (presumably with some distance calculation.) The Carrot2 list may also help clarify if it is the appropriate place. -Grant On May 15, 2008, at 4:34 PM, Otis Gospodnetic wrote: > Have you tried using Carrot2 with Lucene? They work quite well in > tandem! > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > ----- Original Message ---- >> From: Supheakmungkol SARIN <[EMAIL PROTECTED]> >> To: java-user@lucene.apache.org >> Sent: Wednesday, May 14, 2008 11:23:45 PM >> Subject: Document clustering with Lucene >> >> Dear all, >> >> I'd like to do document clustering using full-text with Lucene. In >> other words, >> I would like to group similar documents in their respective groups. >> I searched >> the mailing list and found that there are two ways around. The >> first method is >> to represent the one document as query and search the collection. >> The other way >> would be to construct the vector of terms of each of the documents >> and use the >> cosine distance function to compute the similarity. I found these >> methods here: >> >> - http://www.mail-archive.com/[EMAIL PROTECTED]/msg04916.html) >> . >> >> I would like to know whether there are better way? or any built-in >> functions to >> do clustering in the recent release version of Lucene? >> >> Thank you. >> >> Kind regards, >> >> Supheakmungkol > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -------------------------- Grant Ingersoll Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]