Re: What's the best way to formulate this query?

Colvin Cowie Mon, 08 Aug 2022 13:26:39 -0700

Thanks for the suggestions Mikhail, cheers

On Tue, 2 Aug 2022 at 07:25, Mikhail Khludnev <m...@apache.org> wrote:


> Here are a few discussions
> https://issues.apache.org/jira/browse/LUCENE-7148
> and
>
> https://lucene.apache.org/core/8_11_0/sandbox/org/apache/lucene/search/CoveringQuery.html
>
>
> On Tue, Aug 2, 2022 at 1:25 PM Mikhail Khludnev <m...@apache.org> wrote:
>
> > Hi, Colvin.
> > It reminds me of percolator match logic. I've heard of such plugins for
> > Elastic&Solr.
> > Think about min_should_match in dismax - mm.
> > If one indexes a number of words in a dedicated field, then count every
> > term hit via constant score ^=1, sum hits score, then cut off matches
> with
> > a weak coverage via {!frange} (compare sum of scores to a field with a
> > number of tokens). It was discussed in comments/list years ago. Not sure
> if
> > we moved toward it already.  I also remember that such logic built-in
> > you-know-where
> >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-terms-set-query.html
> > .
> >
> > On Mon, Aug 1, 2022 at 6:59 PM Colvin Cowie <colvin.cowie....@gmail.com>
> > wrote:
> >
> >> Hello,
> >>
> >> Maybe the answer to this is obvious and I'm missing something, but here
> >> goes:
> >>
> >> Suppose I have a field which contains a string of one or more tokens
> from
> >> a
> >> set. The set has about 50 possible values, and the values themselves are
> >> arbitrary (though they are known ahead of time, and could be ordered
> >> alphabetically if it helped). e.g.
> >> doc1: "red"
> >> doc2: "blip red"
> >> doc3: "aardvark blip red"
> >> doc4: "aardvark potato"
> >>
> >> I want to query the field for all documents that contain at least one of
> >> the tokens specified in the query *and no tokens that aren't in the
> >> query*.
> >> What's the best query for that?
> >>
> >> For example, querying for
> >>
> >>    - "*red*" should *only* match doc1 above
> >>    - "*blip red*" should match doc1 *and* doc2
> >>    - "*blip red potato*" should also match doc1 and doc 2.
> >>    - "*aardvark blip*" would not match any of the documents since
> neither
> >>    term appears on its own above, and it would need "*red*" as well to
> >>    match doc3.
> >>    - "*aardvark blip red potato*" would match all of the documents.
> >>
> >>
> >> Options?
> >>
> >>    1. I could formulate the query to include all the required tokens and
> >>    negate all the other tokens from the set, e.g. "*blip red*" would
> >> be "*+(blip
> >>    red) -(aardvark potato....)*", and "*red*" would be "*+(red)
> -(aardvark
> >>    blip potato...)*"... The size of the set is fixed, so the number of
> >>    terms in the query won't change, just whether they are included or
> >>    excluded. But having to specify all the negations seems inefficient.
> >>    2. I could change the way the data is indexed so that the field is
> >>    concatenated deterministically and tokenized as a single value, and
> >> query
> >>    for combinations of terms. e.g. "*blip red*" would be "*blip red
> >>    blip-red*", but with more than a handful of terms the fan-out becomes
> >>    significant, e.g. "*aardvark* *blip red*" becomes "*aardvark blip red
> >>    aardvark-blip aardvark-red blip-red aardvark-blip-red *" and so on,
> >> with
> >>    (2^N)-1 combinations.
> >>
> >> So option 1 should be fairly constant regardless of the number of terms
> >> but
> >> may be wasteful for low numbers of terms, while option 2 generates >
> 1000
> >> combinations for a query with 10 terms. Is that a problem for Lucene in
> >> practice though? For 20 terms it would create >1 million combinations,
> >> which does sound like a problem, but a query with that many terms may
> not
> >> be needed.
> >>
> >> I'm leaning towards 1 - but is it a bad solution? Is there a better
> option
> >> I'm missing?
> >>
> >> On a related note, does the EnumFieldType enable a more efficient query
> >> than other field types, or does it just provide explicit sorting? i.e.
> >> would a multivalued EFT be better for this?
> >>
> >> Thanks,
> >> Colvin
> >>
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>

Re: What's the best way to formulate this query?

Reply via email to