Ahh, interesting point, though I'm afraid it solves a different problem than my intentions. Re-reading this, I think I've described my problem in a very obscure way. Sorry :-/.
Basically, pretend I do a regular search for "category:fiction". After stemming/etc, this would match any Document with a category of "fiction", "non-fiction", "fictitious", etc. All 900+ of them. BUT as far as the results are concerned, I'm not actually interested in each Document that was hit, nor about any other field besides the "category" field. I just want a list of the unique categories that matched the search string of "fiction". In this example, my ultimate goal would be a String[] of: { "fiction", "fictitious", "non-fiction" } ... without any costly iterations of all 900+ Hit Documents' category values of: { "fiction", "non-fiction", "fiction", "fiction", "fiction", "fictitious", "non-fiction", ... } Again, I want to find a *unique* list of "category" field values that match certain query text. I know this can be done using a second index, but wanted to be sure there isn't an obvious, less-hacky way first. I'm used to Lucene surprising me with sneaky efficiencies. Thanks for the valiant effort to make sense of me! :) Tyler On 12/15/05, Michael D. Curtin <[EMAIL PROTECTED]> wrote: > Mr Plate wrote: > > > This puzzle has been bugging me for a while; I'm hoping there's an > > elegant way to handle it in Lucene. > > > > DATA DESCRIPTION: > > > > I've got an index of over 100,000 Documents. In addition to other > > fields, each of these Documents has 0 or more "category" field values. > > There are over 5,500 such categories (it's not a small set). Anywhere > > from 1 to 500+ Documents could belong to a single "category". This > > index does not get updated very often; anywhere from once a day to once > > a month. Indexing time is currently 15-30 minutes from start to > > finish/optimization. > > > > > > PROBLEM: > > > > I'd like to provide users a way to search these "category" values. For > > example, suppose the user searches for "fiction". They might see > > results of: { "fiction", "non-fiction" }. However, I'd like to do this > > search as quickly and efficiently as reasonable. For example, if there > > are 500 Documents of category "fiction", and 400 of "non- fiction", I > > don't want to Sort and iterate through each Hit to weed out the > > duplicate values from my query. > > > > For what it's worth, I imagine only 0-20 categories would match a given > > query. > > > > > > SIMPLEST SOLUTION I CAN THINK OF: > > > > The best I can imagine is to maintain a separate Lucene index for each > > of these category types. Each Document in this separate index would > > probably have fields of "field_name", and "field_value", and would not > > contain any duplicates. For example, you might see a Document of > > field_name "category" and field_value "non-fiction". My query would hit > > this second index instead, to perform these metadata searches. > > > > > > I hope that makes sense; do you know of a more elegant way to handle > > this type of problem? > > I'm guessing that each Document doesn't have a "category" field with > multiple values in it but, instead, has a uniquely-named field for each > category. Would it work to change your data model to the former? That > is, have a Text field named "category" in each document, so that it gets > tokenized and indexed. Then you could do a search of the 5K category > names (outside of Lucene, perhaps by getting the list of Terms from the > "category" field) for the query term of interest, "fiction" in your > example, then compose a Lucene query with the results. Your example > would produce a query equivalent to 'category:fiction > category:non-fiction'. For only 100K documents, this should be pretty fast. > > Good luck! > > --MDC > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]