Re: How to retrieve distinct field matches?

Michael D. Curtin Thu, 15 Dec 2005 18:03:35 -0800

Mr Plate wrote:

This puzzle has been bugging me for a while; I'm hoping there's anelegant way to handle it in Lucene.
DATA DESCRIPTION:
I've got an index of over 100,000 Documents. In addition to otherfields, each of these Documents has 0 or more "category" field values.There are over 5,500 such categories (it's not a small set). Anywherefrom 1 to 500+ Documents could belong to a single "category". Thisindex does not get updated very often; anywhere from once a day to oncea month. Indexing time is currently 15-30 minutes from start tofinish/optimization.
PROBLEM:
I'd like to provide users a way to search these "category" values. Forexample, suppose the user searches for "fiction". They might seeresults of: { "fiction", "non-fiction" }. However, I'd like to do thissearch as quickly and efficiently as reasonable. For example, if thereare 500 Documents of category "fiction", and 400 of "non- fiction", Idon't want to Sort and iterate through each Hit to weed out theduplicate values from my query.
For what it's worth, I imagine only 0-20 categories would match a givenquery.
SIMPLEST SOLUTION I CAN THINK OF:
The best I can imagine is to maintain a separate Lucene index for eachof these category types. Each Document in this separate index wouldprobably have fields of "field_name", and "field_value", and would notcontain any duplicates. For example, you might see a Document offield_name "category" and field_value "non-fiction". My query would hitthis second index instead, to perform these metadata searches.
I hope that makes sense; do you know of a more elegant way to handlethis type of problem?

I'm guessing that each Document doesn't have a "category" field withmultiple values in it but, instead, has a uniquely-named field for eachcategory. Would it work to change your data model to the former? Thatis, have a Text field named "category" in each document, so that it getstokenized and indexed. Then you could do a search of the 5K categorynames (outside of Lucene, perhaps by getting the list of Terms from the"category" field) for the query term of interest, "fiction" in yourexample, then compose a Lucene query with the results. Your examplewould produce a query equivalent to 'category:fictioncategory:non-fiction'. For only 100K documents, this should be pretty fast.


Good luck!

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How to retrieve distinct field matches?

Reply via email to