That's an interesting suggestion Radu, thank you On Mon, 1 Aug 2022 at 10:24, Radu Gheorghe <radu.gheor...@sematext.com> wrote:
> Hi Colvin, > > You wouldn't normally query with more than e.g. 1K terms at once, because > the query can get expensive. > > Here's a crazy idea: map words to numbers, sorted alphabetically. For > example: > > aardvark - 1 > blip - 2 > potato - 3 > red - 4 > > When you formulate the query, you do the same translation, sort the terms, > then search for something like: > - any of the words > - negate any ranges between them > > For example, if I'm searching for "red potato", then the query will be > something like: > > (3 OR 4) -{* TO 3} -{3 TO 4} -{4 TO *} > > Note that I added the 3 to 4 range (exclusive), even though it doesn't make > sense, because the naive implementation wouldn't check if some numbers are > consecutive and remove ranges that make no sense. That would be an > optimization. > > Best regards, > Radu > -- > Elasticsearch/OpenSearch & Solr Consulting, Production Support & Training > Sematext Cloud - Full Stack Observability > https://sematext.com/ <http://sematext.com/> > > > On Mon, Aug 1, 2022 at 11:59 AM Colvin Cowie <colvin.cowie....@gmail.com> > wrote: > > > Hello, > > > > Maybe the answer to this is obvious and I'm missing something, but here > > goes: > > > > Suppose I have a field which contains a string of one or more tokens > from a > > set. The set has about 50 possible values, and the values themselves are > > arbitrary (though they are known ahead of time, and could be ordered > > alphabetically if it helped). e.g. > > doc1: "red" > > doc2: "blip red" > > doc3: "aardvark blip red" > > doc4: "aardvark potato" > > > > I want to query the field for all documents that contain at least one of > > the tokens specified in the query *and no tokens that aren't in the > query*. > > What's the best query for that? > > > > For example, querying for > > > > - "*red*" should *only* match doc1 above > > - "*blip red*" should match doc1 *and* doc2 > > - "*blip red potato*" should also match doc1 and doc 2. > > - "*aardvark blip*" would not match any of the documents since neither > > term appears on its own above, and it would need "*red*" as well to > > match doc3. > > - "*aardvark blip red potato*" would match all of the documents. > > > > > > Options? > > > > 1. I could formulate the query to include all the required tokens and > > negate all the other tokens from the set, e.g. "*blip red*" would > > be "*+(blip > > red) -(aardvark potato....)*", and "*red*" would be "*+(red) > -(aardvark > > blip potato...)*"... The size of the set is fixed, so the number of > > terms in the query won't change, just whether they are included or > > excluded. But having to specify all the negations seems inefficient. > > 2. I could change the way the data is indexed so that the field is > > concatenated deterministically and tokenized as a single value, and > > query > > for combinations of terms. e.g. "*blip red*" would be "*blip red > > blip-red*", but with more than a handful of terms the fan-out becomes > > significant, e.g. "*aardvark* *blip red*" becomes "*aardvark blip red > > aardvark-blip aardvark-red blip-red aardvark-blip-red *" and so on, > with > > (2^N)-1 combinations. > > > > So option 1 should be fairly constant regardless of the number of terms > but > > may be wasteful for low numbers of terms, while option 2 generates > 1000 > > combinations for a query with 10 terms. Is that a problem for Lucene in > > practice though? For 20 terms it would create >1 million combinations, > > which does sound like a problem, but a query with that many terms may not > > be needed. > > > > I'm leaning towards 1 - but is it a bad solution? Is there a better > option > > I'm missing? > > > > On a related note, does the EnumFieldType enable a more efficient query > > than other field types, or does it just provide explicit sorting? i.e. > > would a multivalued EFT be better for this? > > > > Thanks, > > Colvin > > >