Thanks Grant. I'll take a look at Solr's faceting. A colleague of mine also discovered solr's clustering component - http://wiki.apache.org/solr/ClusteringComponent. It's still labeled as experimental - does anybody have experience with it?
Another option (pointed out by your post: http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahou t-with-apache-lucene-and-solr-part-i-of-3/) is mahout with its baked in clustering algorithms. Has anybody gotten good mileage out of this approach? Thanks again, -nik -----Original Message----- From: Grant Ingersoll [mailto:gsing...@apache.org] Sent: Tuesday, August 17, 2010 7:57 AM To: java-user@lucene.apache.org Subject: Re: cluster documents based on fields' values Hi Nik, Inline below. On Aug 15, 2010, at 5:01 PM, Nik Kolev wrote: > Hi, > > I am researching the possibility of using Lucene for discovering > clusters of documents and since I am new to Lucene I decided to > ask the community for advice before I poke the APIs and the internals. > Your input will be invaluable! > > Here's the use case. Documents arrive from different feeds. Each feed > produces millions and millions of documents. Documents are structured > and share certain "interesting" fields. For the purpose of illustration > here's a trivial example. Let's say the documents on each feed represent > shoes. A shoe has an ID (uniquely identifying it within its feed) and a > SIZE [1]. I want to be able to ask Lucene (assuming the shoe documents > are indexed of course) for all of the clusters of shoes that sport the same > size. A cluster of shoes is just the IDs of the shoes that got grouped > together due to having the same value of the SIZE field. > > I don't think that doing this brute force will perform. Here's what I > mean when I say brute force. I looked at IndexReader, and I saw that > I can get the distinct values for an indexed field. So assuming that > each feed will have its own index (I don't think I can get away with > a single index for all feeds [2]), I can get the union of all distinct > values for the interesting field across all indexes (one per each > document feed). Then for each distinct value I can do a MultiSearcher > search across these indexes getting the IDs of the documents . > > My gut tells me that the brute force approach won't perform [3]. And > here's where you guys come in - is it possible to ask lucene to give > me the groups of records (across indixes) that share the value for a > given field? Is there something else in API that I can take advantage > of and get what I need faster? If I am to extend Lucene to allow for > this sort of thing, where should I start, what documents should I > read,...? This feature, called faceting -- amongst other things, comes out of the box in Solr (http://lucene.apache.org/solr) > > Thanks very much again for your advise, > -nik > > [1] Shoe document layout: > SHOE > |--ID > |--SIZE > > [2] The volume of documents I get from each feed is huge. Also, even > though a lot of fields are shared the format across feeds varies > and a feed may not have some of the "interesting" fields of > another feed. > > [3] The dataset I am talking about is rich and the brute force approach > means that I need to be doing tens of millions of searches just to > group on one field. Also I most likely will blow my heap up if I try to > load all of the values in memory all at once. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org