What's the best way to formulate this query?

Colvin Cowie Mon, 01 Aug 2022 01:59:17 -0700

Hello,

Maybe the answer to this is obvious and I'm missing something, but here
goes:


Suppose I have a field which contains a string of one or more tokens from a
set. The set has about 50 possible values, and the values themselves are
arbitrary (though they are known ahead of time, and could be ordered
alphabetically if it helped). e.g.
doc1: "red"
doc2: "blip red"
doc3: "aardvark blip red"
doc4: "aardvark potato"

I want to query the field for all documents that contain at least one of
the tokens specified in the query *and no tokens that aren't in the query*.
What's the best query for that?

For example, querying for

   - "*red*" should *only* match doc1 above
   - "*blip red*" should match doc1 *and* doc2
   - "*blip red potato*" should also match doc1 and doc 2.
   - "*aardvark blip*" would not match any of the documents since neither
   term appears on its own above, and it would need "*red*" as well to
   match doc3.
   - "*aardvark blip red potato*" would match all of the documents.


Options?

   1. I could formulate the query to include all the required tokens and
   negate all the other tokens from the set, e.g. "*blip red*" would
be "*+(blip
   red) -(aardvark potato....)*", and "*red*" would be "*+(red) -(aardvark
   blip potato...)*"... The size of the set is fixed, so the number of
   terms in the query won't change, just whether they are included or
   excluded. But having to specify all the negations seems inefficient.
   2. I could change the way the data is indexed so that the field is
   concatenated deterministically and tokenized as a single value, and query
   for combinations of terms. e.g. "*blip red*" would be "*blip red
   blip-red*", but with more than a handful of terms the fan-out becomes
   significant, e.g. "*aardvark* *blip red*" becomes "*aardvark blip red
   aardvark-blip aardvark-red blip-red aardvark-blip-red *" and so on, with
   (2^N)-1 combinations.

So option 1 should be fairly constant regardless of the number of terms but
may be wasteful for low numbers of terms, while option 2 generates > 1000
combinations for a query with 10 terms. Is that a problem for Lucene in
practice though? For 20 terms it would create >1 million combinations,
which does sound like a problem, but a query with that many terms may not
be needed.

I'm leaning towards 1 - but is it a bad solution? Is there a better option
I'm missing?

On a related note, does the EnumFieldType enable a more efficient query
than other field types, or does it just provide explicit sorting? i.e.
would a multivalued EFT be better for this?

Thanks,
Colvin

What's the best way to formulate this query?

Reply via email to