Thanks for the suggestions Mikhail, cheers On Tue, 2 Aug 2022 at 07:25, Mikhail Khludnev <m...@apache.org> wrote:
> Here are a few discussions > https://issues.apache.org/jira/browse/LUCENE-7148 > and > > https://lucene.apache.org/core/8_11_0/sandbox/org/apache/lucene/search/CoveringQuery.html > > > On Tue, Aug 2, 2022 at 1:25 PM Mikhail Khludnev <m...@apache.org> wrote: > > > Hi, Colvin. > > It reminds me of percolator match logic. I've heard of such plugins for > > Elastic&Solr. > > Think about min_should_match in dismax - mm. > > If one indexes a number of words in a dedicated field, then count every > > term hit via constant score ^=1, sum hits score, then cut off matches > with > > a weak coverage via {!frange} (compare sum of scores to a field with a > > number of tokens). It was discussed in comments/list years ago. Not sure > if > > we moved toward it already. I also remember that such logic built-in > > you-know-where > > > https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-terms-set-query.html > > . > > > > On Mon, Aug 1, 2022 at 6:59 PM Colvin Cowie <colvin.cowie....@gmail.com> > > wrote: > > > >> Hello, > >> > >> Maybe the answer to this is obvious and I'm missing something, but here > >> goes: > >> > >> Suppose I have a field which contains a string of one or more tokens > from > >> a > >> set. The set has about 50 possible values, and the values themselves are > >> arbitrary (though they are known ahead of time, and could be ordered > >> alphabetically if it helped). e.g. > >> doc1: "red" > >> doc2: "blip red" > >> doc3: "aardvark blip red" > >> doc4: "aardvark potato" > >> > >> I want to query the field for all documents that contain at least one of > >> the tokens specified in the query *and no tokens that aren't in the > >> query*. > >> What's the best query for that? > >> > >> For example, querying for > >> > >> - "*red*" should *only* match doc1 above > >> - "*blip red*" should match doc1 *and* doc2 > >> - "*blip red potato*" should also match doc1 and doc 2. > >> - "*aardvark blip*" would not match any of the documents since > neither > >> term appears on its own above, and it would need "*red*" as well to > >> match doc3. > >> - "*aardvark blip red potato*" would match all of the documents. > >> > >> > >> Options? > >> > >> 1. I could formulate the query to include all the required tokens and > >> negate all the other tokens from the set, e.g. "*blip red*" would > >> be "*+(blip > >> red) -(aardvark potato....)*", and "*red*" would be "*+(red) > -(aardvark > >> blip potato...)*"... The size of the set is fixed, so the number of > >> terms in the query won't change, just whether they are included or > >> excluded. But having to specify all the negations seems inefficient. > >> 2. I could change the way the data is indexed so that the field is > >> concatenated deterministically and tokenized as a single value, and > >> query > >> for combinations of terms. e.g. "*blip red*" would be "*blip red > >> blip-red*", but with more than a handful of terms the fan-out becomes > >> significant, e.g. "*aardvark* *blip red*" becomes "*aardvark blip red > >> aardvark-blip aardvark-red blip-red aardvark-blip-red *" and so on, > >> with > >> (2^N)-1 combinations. > >> > >> So option 1 should be fairly constant regardless of the number of terms > >> but > >> may be wasteful for low numbers of terms, while option 2 generates > > 1000 > >> combinations for a query with 10 terms. Is that a problem for Lucene in > >> practice though? For 20 terms it would create >1 million combinations, > >> which does sound like a problem, but a query with that many terms may > not > >> be needed. > >> > >> I'm leaning towards 1 - but is it a bad solution? Is there a better > option > >> I'm missing? > >> > >> On a related note, does the EnumFieldType enable a more efficient query > >> than other field types, or does it just provide explicit sorting? i.e. > >> would a multivalued EFT be better for this? > >> > >> Thanks, > >> Colvin > >> > > > > > > -- > > Sincerely yours > > Mikhail Khludnev > > > > > -- > Sincerely yours > Mikhail Khludnev >