Hi, I am researching the possibility of using Lucene for discovering clusters of documents and since I am new to Lucene I decided to ask the community for advice before I poke the APIs and the internals. Your input will be invaluable!
Here's the use case. Documents arrive from different feeds. Each feed produces millions and millions of documents. Documents are structured and share certain "interesting" fields. For the purpose of illustration here's a trivial example. Let's say the documents on each feed represent shoes. A shoe has an ID (uniquely identifying it within its feed) and a SIZE [1]. I want to be able to ask Lucene (assuming the shoe documents are indexed of course) for all of the clusters of shoes that sport the same size. A cluster of shoes is just the IDs of the shoes that got grouped together due to having the same value of the SIZE field. I don't think that doing this brute force will perform. Here's what I mean when I say brute force. I looked at IndexReader, and I saw that I can get the distinct values for an indexed field. So assuming that each feed will have its own index (I don't think I can get away with a single index for all feeds [2]), I can get the union of all distinct values for the interesting field across all indexes (one per each document feed). Then for each distinct value I can do a MultiSearcher search across these indexes getting the IDs of the documents . My gut tells me that the brute force approach won't perform [3]. And here's where you guys come in - is it possible to ask lucene to give me the groups of records (across indixes) that share the value for a given field? Is there something else in API that I can take advantage of and get what I need faster? If I am to extend Lucene to allow for this sort of thing, where should I start, what documents should I read,...? Thanks very much again for your advise, -nik [1] Shoe document layout: SHOE |--ID |--SIZE [2] The volume of documents I get from each feed is huge. Also, even though a lot of fields are shared the format across feeds varies and a feed may not have some of the "interesting" fields of another feed. [3] The dataset I am talking about is rich and the brute force approach means that I need to be doing tens of millions of searches just to group on one field. Also I most likely will blow my heap up if I try to load all of the values in memory all at once. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org