Hi, Colvin.
It reminds me of percolator match logic. I've heard of such plugins for
Elastic&Solr.
Think about min_should_match in dismax - mm.
If one indexes a number of words in a dedicated field, then count every
term hit via constant score ^=1, sum hits score, then cut off matches with
a weak coverage via {!frange} (compare sum of scores to a field with a
number of tokens). It was discussed in comments/list years ago. Not sure if
we moved toward it already.  I also remember that such logic built-in
you-know-where
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-terms-set-query.html
.

On Mon, Aug 1, 2022 at 6:59 PM Colvin Cowie <colvin.cowie....@gmail.com>
wrote:

> Hello,
>
> Maybe the answer to this is obvious and I'm missing something, but here
> goes:
>
> Suppose I have a field which contains a string of one or more tokens from a
> set. The set has about 50 possible values, and the values themselves are
> arbitrary (though they are known ahead of time, and could be ordered
> alphabetically if it helped). e.g.
> doc1: "red"
> doc2: "blip red"
> doc3: "aardvark blip red"
> doc4: "aardvark potato"
>
> I want to query the field for all documents that contain at least one of
> the tokens specified in the query *and no tokens that aren't in the query*.
> What's the best query for that?
>
> For example, querying for
>
>    - "*red*" should *only* match doc1 above
>    - "*blip red*" should match doc1 *and* doc2
>    - "*blip red potato*" should also match doc1 and doc 2.
>    - "*aardvark blip*" would not match any of the documents since neither
>    term appears on its own above, and it would need "*red*" as well to
>    match doc3.
>    - "*aardvark blip red potato*" would match all of the documents.
>
>
> Options?
>
>    1. I could formulate the query to include all the required tokens and
>    negate all the other tokens from the set, e.g. "*blip red*" would
> be "*+(blip
>    red) -(aardvark potato....)*", and "*red*" would be "*+(red) -(aardvark
>    blip potato...)*"... The size of the set is fixed, so the number of
>    terms in the query won't change, just whether they are included or
>    excluded. But having to specify all the negations seems inefficient.
>    2. I could change the way the data is indexed so that the field is
>    concatenated deterministically and tokenized as a single value, and
> query
>    for combinations of terms. e.g. "*blip red*" would be "*blip red
>    blip-red*", but with more than a handful of terms the fan-out becomes
>    significant, e.g. "*aardvark* *blip red*" becomes "*aardvark blip red
>    aardvark-blip aardvark-red blip-red aardvark-blip-red *" and so on, with
>    (2^N)-1 combinations.
>
> So option 1 should be fairly constant regardless of the number of terms but
> may be wasteful for low numbers of terms, while option 2 generates > 1000
> combinations for a query with 10 terms. Is that a problem for Lucene in
> practice though? For 20 terms it would create >1 million combinations,
> which does sound like a problem, but a query with that many terms may not
> be needed.
>
> I'm leaning towards 1 - but is it a bad solution? Is there a better option
> I'm missing?
>
> On a related note, does the EnumFieldType enable a more efficient query
> than other field types, or does it just provide explicit sorting? i.e.
> would a multivalued EFT be better for this?
>
> Thanks,
> Colvin
>


-- 
Sincerely yours
Mikhail Khludnev

Reply via email to