Mr Plate wrote:
This puzzle has been bugging me for a while; I'm hoping there's an
elegant way to handle it in Lucene.
DATA DESCRIPTION:
I've got an index of over 100,000 Documents. In addition to other
fields, each of these Documents has 0 or more "category" field values.
There are over 5,500 such categories (it's not a small set). Anywhere
from 1 to 500+ Documents could belong to a single "category". This
index does not get updated very often; anywhere from once a day to once
a month. Indexing time is currently 15-30 minutes from start to
finish/optimization.
PROBLEM:
I'd like to provide users a way to search these "category" values. For
example, suppose the user searches for "fiction". They might see
results of: { "fiction", "non-fiction" }. However, I'd like to do this
search as quickly and efficiently as reasonable. For example, if there
are 500 Documents of category "fiction", and 400 of "non- fiction", I
don't want to Sort and iterate through each Hit to weed out the
duplicate values from my query.
For what it's worth, I imagine only 0-20 categories would match a given
query.
SIMPLEST SOLUTION I CAN THINK OF:
The best I can imagine is to maintain a separate Lucene index for each
of these category types. Each Document in this separate index would
probably have fields of "field_name", and "field_value", and would not
contain any duplicates. For example, you might see a Document of
field_name "category" and field_value "non-fiction". My query would hit
this second index instead, to perform these metadata searches.
I hope that makes sense; do you know of a more elegant way to handle
this type of problem?
I'm guessing that each Document doesn't have a "category" field with
multiple values in it but, instead, has a uniquely-named field for each
category. Would it work to change your data model to the former? That
is, have a Text field named "category" in each document, so that it gets
tokenized and indexed. Then you could do a search of the 5K category
names (outside of Lucene, perhaps by getting the list of Terms from the
"category" field) for the query term of interest, "fiction" in your
example, then compose a Lucene query with the results. Your example
would produce a query equivalent to 'category:fiction
category:non-fiction'. For only 100K documents, this should be pretty fast.
Good luck!
--MDC
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]