On Tuesday 25 September 2007 03:05, Martin Bayly wrote: > > We have an application that performs searches against a Lucene based index > and also against a Windows Desktop Search based index. > > For simple queries we'd like to offer our users a consistent interface that > allows them to build basic Lucene style queries using the 'MUST HAVE' (+), > 'MUST NOT HAVE' (-) and 'SHOULD HAVE' style of operators as this is probably > more intuitive for non 'Boolean Logic' literate users. We would not allow > them to use any grouping (parenthesis). > > Clearly we can pass this directly to Lucene, but for the Windows Desktop > Search we need to translate the Lucene style query into a more traditional > Boolean query. So this is the opposite of the much discussed Boolean Query > to Lucene Query conversion. > > I'm wondering if anyone has ever done this or whether there is a concept > mismatch in there somewhere that will make it difficult to do? > > My thought was that you could take the standard Lucene operators and simply > group them together as follows: > > e.g. (assuming the Lucene default OR operator) > > Lucene: +a +b -c -d e f > > would translate to: > > (a AND b NOT c NOT d) OR (a AND b NOT c NOT d AND (e OR f)) > > If I put this back into Lucene (actually Lucene.NET but hopefully its the > same) I get back: > > (+a +b -c -d)(+a +b -c -d +(e f)) > > which I think is equivalent but not as concise! But I have not tested this > against a big index to see if it's equivalent and I have a suspicion that > Lucene might score the two versions of the Lucene representation > differently. But that's probably not an issue provided the Boolean > representation is semantically equivalent to the first Lucene > representation. > > Anyone ever tried this before or have any comments on whether my 'logic' is > flawed!
Under the hood, the Scorer for a BooleanQuery, BooleanScorer2, does the conversion from + and - to boolean operators in slightly different, but more concise way. It basically maps the boolean query syntax to four operators: AND, OR, ANDNOT, ANDoptional. The mapping is only basical because, as a Scorer, it needs to map to other Scorers, and these are: ConjunctionScorer, DisjunctionSumScorer, ReqExclScorer and ReqOptSumScorer, respectively. For ANDoptional the first subquery is required, and the second one is optional. This is a bit like ANDNOT in which the first query is required, and the second one is prohibited. The addition of ANDoptional/ReqOptSumScorer allows the conciseness, while keeping equivalent semantics. So I think your logic is not flawed, it's just that the traditional set of boolean operators is somehow incomplete for optional subqueries. Mapping + and - to these four operators is not really straightforward, among others because of the coordination factor, and because of the many different possible situations. You're invited to have a look at the source code of BooleanScorer2. There are also test cases for the equivalence of the semantics, see the TestBoolean* classes. Regards, Paul Elschot P.S. When documents may be scored out of order, for some disjunctions (OR), BooleanScorer is used for performance. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]