Hi Stamatis, One thing that you missed regarding semantics is scoring. While (A B C)~2 and ((+A +B) (+A +C) (+B +C)) would match the same documents, they would produce different scores.
Moreover, many users come to this query because it is exactly what they need: matching k out of n clauses. In the example you gave it's pretty simple because there are only 3 clauses, but try to see what the generated query looks like when matching 3 out of 5 clauses, it's already very complex. It would be nice if we could rewrite the expanded form into the variant that sets a minimum number of matching should clauses, which should be more efficient. My worry is that it would be quite expensive to do, maybe to the point that it would more hurt than help on average. I'd be very happy to be proven wrong though, if we can cheaply rewrite the expanded form, this would be a good addition. On Mon, Mar 30, 2020 at 6:06 PM Stamatis Zampetakis <zabe...@gmail.com> wrote: > Hi all, > > My question concerns the method setMinimumNumberShouldMatch in BooleaQuery > class. > > Lets assume that we have 3 queries (optional clauses), namely A, B, C and > we build a BooleanQuery specifying that at least 2 should match. > > In terms of semantics what I understand so far is that > > (A B C)~2 is equivalent to ((+A +B) (+A +C) (+B +C)). > > In other words a single BooleaQuery with a min should match parameter could > be rewritten as pure disjunctive BooleanQuery comprised from 3 sub-queries. > > In terms of performance it seems that the two queries present different > behavior so the minMatch property is not only syntactic sugar and > apparently there is no rewriting between the two. > > Coming from the SQL world it is a bit hard for me to justify the addition > of a new operator that looks like syntactic sugar and at the same time is > more performant than the more primitive equivalents. I looked a bit in [1] > to understand motivation for adding this API but without much success. > > Summing up everything to three questions: > 1. Did I get right the semantics of this extra property or there are things > that I am missing? > (If my understanding is correct) > 2. What's the reason for introducing the minMatch property in the first > place? (Avoid creating huge queries?) > 3. Should the performance of the two queries shown above differ? > > Thanks in advance! > > Best, > Stamatis > > [1] https://issues.apache.org/jira/browse/LUCENE-395 > -- Adrien