Hi Colvin,

You wouldn't normally query with more than e.g. 1K terms at once, because
the query can get expensive.

Here's a crazy idea: map words to numbers, sorted alphabetically. For
example:

aardvark - 1
blip - 2
potato - 3
red - 4

When you formulate the query, you do the same translation, sort the terms,
then search for something like:
- any of the words
- negate any ranges between them

For example, if I'm searching for "red potato", then the query will be
something like:

(3 OR 4) -{* TO 3} -{3 TO 4} -{4 TO *}

Note that I added the 3 to 4 range (exclusive), even though it doesn't make
sense, because the naive implementation wouldn't check if some numbers are
consecutive and remove ranges that make no sense. That would be an
optimization.

Best regards,
Radu
--
Elasticsearch/OpenSearch & Solr Consulting, Production Support & Training
Sematext Cloud - Full Stack Observability
https://sematext.com/ <http://sematext.com/>


On Mon, Aug 1, 2022 at 11:59 AM Colvin Cowie <colvin.cowie....@gmail.com>
wrote:

> Hello,
>
> Maybe the answer to this is obvious and I'm missing something, but here
> goes:
>
> Suppose I have a field which contains a string of one or more tokens from a
> set. The set has about 50 possible values, and the values themselves are
> arbitrary (though they are known ahead of time, and could be ordered
> alphabetically if it helped). e.g.
> doc1: "red"
> doc2: "blip red"
> doc3: "aardvark blip red"
> doc4: "aardvark potato"
>
> I want to query the field for all documents that contain at least one of
> the tokens specified in the query *and no tokens that aren't in the query*.
> What's the best query for that?
>
> For example, querying for
>
>    - "*red*" should *only* match doc1 above
>    - "*blip red*" should match doc1 *and* doc2
>    - "*blip red potato*" should also match doc1 and doc 2.
>    - "*aardvark blip*" would not match any of the documents since neither
>    term appears on its own above, and it would need "*red*" as well to
>    match doc3.
>    - "*aardvark blip red potato*" would match all of the documents.
>
>
> Options?
>
>    1. I could formulate the query to include all the required tokens and
>    negate all the other tokens from the set, e.g. "*blip red*" would
> be "*+(blip
>    red) -(aardvark potato....)*", and "*red*" would be "*+(red) -(aardvark
>    blip potato...)*"... The size of the set is fixed, so the number of
>    terms in the query won't change, just whether they are included or
>    excluded. But having to specify all the negations seems inefficient.
>    2. I could change the way the data is indexed so that the field is
>    concatenated deterministically and tokenized as a single value, and
> query
>    for combinations of terms. e.g. "*blip red*" would be "*blip red
>    blip-red*", but with more than a handful of terms the fan-out becomes
>    significant, e.g. "*aardvark* *blip red*" becomes "*aardvark blip red
>    aardvark-blip aardvark-red blip-red aardvark-blip-red *" and so on, with
>    (2^N)-1 combinations.
>
> So option 1 should be fairly constant regardless of the number of terms but
> may be wasteful for low numbers of terms, while option 2 generates > 1000
> combinations for a query with 10 terms. Is that a problem for Lucene in
> practice though? For 20 terms it would create >1 million combinations,
> which does sound like a problem, but a query with that many terms may not
> be needed.
>
> I'm leaning towards 1 - but is it a bad solution? Is there a better option
> I'm missing?
>
> On a related note, does the EnumFieldType enable a more efficient query
> than other field types, or does it just provide explicit sorting? i.e.
> would a multivalued EFT be better for this?
>
> Thanks,
> Colvin
>

Reply via email to