This puzzle has been bugging me for a while; I'm hoping there's an
elegant way to handle it in Lucene.
DATA DESCRIPTION:
I've got an index of over 100,000 Documents. In addition to other
fields, each of these Documents has 0 or more "category" field
values. There are over 5,500 such categories (it's not a small set).
Anywhere from 1 to 500+ Documents could belong to a single
"category". This index does not get updated very often; anywhere from
once a day to once a month. Indexing time is currently 15-30 minutes
from start to finish/optimization.
PROBLEM:
I'd like to provide users a way to search these "category" values.
For example, suppose the user searches for "fiction". They might see
results of: { "fiction", "non-fiction" }. However, I'd like to do
this search as quickly and efficiently as reasonable. For example, if
there are 500 Documents of category "fiction", and 400 of "non-
fiction", I don't want to Sort and iterate through each Hit to weed
out the duplicate values from my query.
For what it's worth, I imagine only 0-20 categories would match a
given query.
SIMPLEST SOLUTION I CAN THINK OF:
The best I can imagine is to maintain a separate Lucene index for
each of these category types. Each Document in this separate index
would probably have fields of "field_name", and "field_value", and
would not contain any duplicates. For example, you might see a
Document of field_name "category" and field_value "non-fiction". My
query would hit this second index instead, to perform these metadata
searches.
I hope that makes sense; do you know of a more elegant way to handle
this type of problem?
Thanks,
Tyler
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]