Thanks Adrien, indeed I missed the difference in the score since in my context I do not need them most of the time.
I will try to check if there is something that can be done with respect to the rewriting and if something promising comes up I will let you know. Best, Stamatis On Mon, Mar 30, 2020, 7:20 PM Adrien Grand <jpou...@gmail.com> wrote: > Hi Stamatis, > > One thing that you missed regarding semantics is scoring. While (A B C)~2 > and ((+A +B) (+A +C) (+B +C)) would match the same documents, they would > produce different scores. > > Moreover, many users come to this query because it is exactly what they > need: matching k out of n clauses. In the example you gave it's pretty > simple because there are only 3 clauses, but try to see what the generated > query looks like when matching 3 out of 5 clauses, it's already very > complex. > > It would be nice if we could rewrite the expanded form into the variant > that sets a minimum number of matching should clauses, which should be more > efficient. My worry is that it would be quite expensive to do, maybe to the > point that it would more hurt than help on average. I'd be very happy to be > proven wrong though, if we can cheaply rewrite the expanded form, this > would be a good addition. > > On Mon, Mar 30, 2020 at 6:06 PM Stamatis Zampetakis <zabe...@gmail.com> > wrote: > > > Hi all, > > > > My question concerns the method setMinimumNumberShouldMatch in > BooleaQuery > > class. > > > > Lets assume that we have 3 queries (optional clauses), namely A, B, C and > > we build a BooleanQuery specifying that at least 2 should match. > > > > In terms of semantics what I understand so far is that > > > > (A B C)~2 is equivalent to ((+A +B) (+A +C) (+B +C)). > > > > In other words a single BooleaQuery with a min should match parameter > could > > be rewritten as pure disjunctive BooleanQuery comprised from 3 > sub-queries. > > > > In terms of performance it seems that the two queries present different > > behavior so the minMatch property is not only syntactic sugar and > > apparently there is no rewriting between the two. > > > > Coming from the SQL world it is a bit hard for me to justify the addition > > of a new operator that looks like syntactic sugar and at the same time is > > more performant than the more primitive equivalents. I looked a bit in > [1] > > to understand motivation for adding this API but without much success. > > > > Summing up everything to three questions: > > 1. Did I get right the semantics of this extra property or there are > things > > that I am missing? > > (If my understanding is correct) > > 2. What's the reason for introducing the minMatch property in the first > > place? (Avoid creating huge queries?) > > 3. Should the performance of the two queries shown above differ? > > > > Thanks in advance! > > > > Best, > > Stamatis > > > > [1] https://issues.apache.org/jira/browse/LUCENE-395 > > > > > -- > Adrien >