Mr Plate wrote:

This puzzle has been bugging me for a while; I'm hoping there's an elegant way to handle it in Lucene.

DATA DESCRIPTION:

I've got an index of over 100,000 Documents. In addition to other fields, each of these Documents has 0 or more "category" field values. There are over 5,500 such categories (it's not a small set). Anywhere from 1 to 500+ Documents could belong to a single "category". This index does not get updated very often; anywhere from once a day to once a month. Indexing time is currently 15-30 minutes from start to finish/optimization.


PROBLEM:

I'd like to provide users a way to search these "category" values. For example, suppose the user searches for "fiction". They might see results of: { "fiction", "non-fiction" }. However, I'd like to do this search as quickly and efficiently as reasonable. For example, if there are 500 Documents of category "fiction", and 400 of "non- fiction", I don't want to Sort and iterate through each Hit to weed out the duplicate values from my query.

For what it's worth, I imagine only 0-20 categories would match a given query.


SIMPLEST SOLUTION I CAN THINK OF:

The best I can imagine is to maintain a separate Lucene index for each of these category types. Each Document in this separate index would probably have fields of "field_name", and "field_value", and would not contain any duplicates. For example, you might see a Document of field_name "category" and field_value "non-fiction". My query would hit this second index instead, to perform these metadata searches.


I hope that makes sense; do you know of a more elegant way to handle this type of problem?

I'm guessing that each Document doesn't have a "category" field with multiple values in it but, instead, has a uniquely-named field for each category. Would it work to change your data model to the former? That is, have a Text field named "category" in each document, so that it gets tokenized and indexed. Then you could do a search of the 5K category names (outside of Lucene, perhaps by getting the list of Terms from the "category" field) for the query term of interest, "fiction" in your example, then compose a Lucene query with the results. Your example would produce a query equivalent to 'category:fiction category:non-fiction'. For only 100K documents, this should be pretty fast.

Good luck!

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to