cluster documents based on fields' values

Nik Kolev Sun, 15 Aug 2010 14:04:56 -0700

Hi,

I am researching the possibility of using Lucene for discovering
clusters of documents and since I am new to Lucene I decided to
ask the community for advice before I poke the APIs and the internals.
Your input will be invaluable!


Here's the use case. Documents arrive from different feeds. Each feed
produces millions and millions of documents. Documents are structured
and share certain "interesting" fields. For the purpose of illustration
here's a trivial example. Let's say the documents on each feed represent
shoes. A shoe has an ID (uniquely identifying it within its feed) and a
SIZE [1]. I want to be able to ask Lucene (assuming the shoe documents
are indexed of course) for all of the clusters of shoes that sport the same
size. A cluster of shoes is just the IDs of the shoes that got grouped
together due to having the same value of the SIZE field.

I don't think that doing this brute force will perform. Here's what I
mean when I say brute force. I looked at IndexReader, and I saw that
I can get the distinct values for an indexed field. So assuming that
each feed will have its own index (I don't think I can get away with
a single index for all feeds [2]), I can get the union of all distinct
values for the interesting field across all indexes (one per each
document feed). Then for each distinct value I can do a MultiSearcher
search across these indexes getting the IDs of the documents .

My gut tells me that the brute force approach won't perform [3]. And
here's where you guys come in - is it possible to ask lucene to give
me the groups of records (across indixes) that share the value for a
given field? Is there something else in API that I can take advantage
of and get what I need faster? If I am to extend Lucene to allow for
this sort of thing, where should I start, what documents should I
read,...?

Thanks very much again for your advise,
-nik

[1] Shoe document layout:
    SHOE
      |--ID
      |--SIZE

[2] The volume of documents I get from each feed is huge. Also, even
    though a lot of fields are shared the format across feeds varies
    and a feed may not have some of the "interesting" fields of
    another feed.

[3] The dataset I am talking about is rich and the brute force approach
    means that I need to be doing tens of millions of searches just to
    group on one field. Also I most likely will blow my heap up if I try to
    load all of the values in memory all at once.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

cluster documents based on fields' values

Reply via email to