Hello, Maybe the answer to this is obvious and I'm missing something, but here goes:
Suppose I have a field which contains a string of one or more tokens from a set. The set has about 50 possible values, and the values themselves are arbitrary (though they are known ahead of time, and could be ordered alphabetically if it helped). e.g. doc1: "red" doc2: "blip red" doc3: "aardvark blip red" doc4: "aardvark potato" I want to query the field for all documents that contain at least one of the tokens specified in the query *and no tokens that aren't in the query*. What's the best query for that? For example, querying for - "*red*" should *only* match doc1 above - "*blip red*" should match doc1 *and* doc2 - "*blip red potato*" should also match doc1 and doc 2. - "*aardvark blip*" would not match any of the documents since neither term appears on its own above, and it would need "*red*" as well to match doc3. - "*aardvark blip red potato*" would match all of the documents. Options? 1. I could formulate the query to include all the required tokens and negate all the other tokens from the set, e.g. "*blip red*" would be "*+(blip red) -(aardvark potato....)*", and "*red*" would be "*+(red) -(aardvark blip potato...)*"... The size of the set is fixed, so the number of terms in the query won't change, just whether they are included or excluded. But having to specify all the negations seems inefficient. 2. I could change the way the data is indexed so that the field is concatenated deterministically and tokenized as a single value, and query for combinations of terms. e.g. "*blip red*" would be "*blip red blip-red*", but with more than a handful of terms the fan-out becomes significant, e.g. "*aardvark* *blip red*" becomes "*aardvark blip red aardvark-blip aardvark-red blip-red aardvark-blip-red *" and so on, with (2^N)-1 combinations. So option 1 should be fairly constant regardless of the number of terms but may be wasteful for low numbers of terms, while option 2 generates > 1000 combinations for a query with 10 terms. Is that a problem for Lucene in practice though? For 20 terms it would create >1 million combinations, which does sound like a problem, but a query with that many terms may not be needed. I'm leaning towards 1 - but is it a bad solution? Is there a better option I'm missing? On a related note, does the EnumFieldType enable a more efficient query than other field types, or does it just provide explicit sorting? i.e. would a multivalued EFT be better for this? Thanks, Colvin