This puzzle has been bugging me for a while; I'm hoping there's an elegant way to handle it in Lucene.

DATA DESCRIPTION:

I've got an index of over 100,000 Documents. In addition to other fields, each of these Documents has 0 or more "category" field values. There are over 5,500 such categories (it's not a small set). Anywhere from 1 to 500+ Documents could belong to a single "category". This index does not get updated very often; anywhere from once a day to once a month. Indexing time is currently 15-30 minutes from start to finish/optimization.


PROBLEM:

I'd like to provide users a way to search these "category" values. For example, suppose the user searches for "fiction". They might see results of: { "fiction", "non-fiction" }. However, I'd like to do this search as quickly and efficiently as reasonable. For example, if there are 500 Documents of category "fiction", and 400 of "non- fiction", I don't want to Sort and iterate through each Hit to weed out the duplicate values from my query.

For what it's worth, I imagine only 0-20 categories would match a given query.


SIMPLEST SOLUTION I CAN THINK OF:

The best I can imagine is to maintain a separate Lucene index for each of these category types. Each Document in this separate index would probably have fields of "field_name", and "field_value", and would not contain any duplicates. For example, you might see a Document of field_name "category" and field_value "non-fiction". My query would hit this second index instead, to perform these metadata searches.


I hope that makes sense; do you know of a more elegant way to handle this type of problem?


Thanks,

Tyler

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to