That's an interesting suggestion Radu, thank you

On Mon, 1 Aug 2022 at 10:24, Radu Gheorghe <radu.gheor...@sematext.com>
wrote:

> Hi Colvin,
>
> You wouldn't normally query with more than e.g. 1K terms at once, because
> the query can get expensive.
>
> Here's a crazy idea: map words to numbers, sorted alphabetically. For
> example:
>
> aardvark - 1
> blip - 2
> potato - 3
> red - 4
>
> When you formulate the query, you do the same translation, sort the terms,
> then search for something like:
> - any of the words
> - negate any ranges between them
>
> For example, if I'm searching for "red potato", then the query will be
> something like:
>
> (3 OR 4) -{* TO 3} -{3 TO 4} -{4 TO *}
>
> Note that I added the 3 to 4 range (exclusive), even though it doesn't make
> sense, because the naive implementation wouldn't check if some numbers are
> consecutive and remove ranges that make no sense. That would be an
> optimization.
>
> Best regards,
> Radu
> --
> Elasticsearch/OpenSearch & Solr Consulting, Production Support & Training
> Sematext Cloud - Full Stack Observability
> https://sematext.com/ <http://sematext.com/>
>
>
> On Mon, Aug 1, 2022 at 11:59 AM Colvin Cowie <colvin.cowie....@gmail.com>
> wrote:
>
> > Hello,
> >
> > Maybe the answer to this is obvious and I'm missing something, but here
> > goes:
> >
> > Suppose I have a field which contains a string of one or more tokens
> from a
> > set. The set has about 50 possible values, and the values themselves are
> > arbitrary (though they are known ahead of time, and could be ordered
> > alphabetically if it helped). e.g.
> > doc1: "red"
> > doc2: "blip red"
> > doc3: "aardvark blip red"
> > doc4: "aardvark potato"
> >
> > I want to query the field for all documents that contain at least one of
> > the tokens specified in the query *and no tokens that aren't in the
> query*.
> > What's the best query for that?
> >
> > For example, querying for
> >
> >    - "*red*" should *only* match doc1 above
> >    - "*blip red*" should match doc1 *and* doc2
> >    - "*blip red potato*" should also match doc1 and doc 2.
> >    - "*aardvark blip*" would not match any of the documents since neither
> >    term appears on its own above, and it would need "*red*" as well to
> >    match doc3.
> >    - "*aardvark blip red potato*" would match all of the documents.
> >
> >
> > Options?
> >
> >    1. I could formulate the query to include all the required tokens and
> >    negate all the other tokens from the set, e.g. "*blip red*" would
> > be "*+(blip
> >    red) -(aardvark potato....)*", and "*red*" would be "*+(red)
> -(aardvark
> >    blip potato...)*"... The size of the set is fixed, so the number of
> >    terms in the query won't change, just whether they are included or
> >    excluded. But having to specify all the negations seems inefficient.
> >    2. I could change the way the data is indexed so that the field is
> >    concatenated deterministically and tokenized as a single value, and
> > query
> >    for combinations of terms. e.g. "*blip red*" would be "*blip red
> >    blip-red*", but with more than a handful of terms the fan-out becomes
> >    significant, e.g. "*aardvark* *blip red*" becomes "*aardvark blip red
> >    aardvark-blip aardvark-red blip-red aardvark-blip-red *" and so on,
> with
> >    (2^N)-1 combinations.
> >
> > So option 1 should be fairly constant regardless of the number of terms
> but
> > may be wasteful for low numbers of terms, while option 2 generates > 1000
> > combinations for a query with 10 terms. Is that a problem for Lucene in
> > practice though? For 20 terms it would create >1 million combinations,
> > which does sound like a problem, but a query with that many terms may not
> > be needed.
> >
> > I'm leaning towards 1 - but is it a bad solution? Is there a better
> option
> > I'm missing?
> >
> > On a related note, does the EnumFieldType enable a more efficient query
> > than other field types, or does it just provide explicit sorting? i.e.
> > would a multivalued EFT be better for this?
> >
> > Thanks,
> > Colvin
> >
>

Reply via email to