Re: Translating Lucene Query Syntax to Traditional Boolean Syntax

Paul Elschot Tue, 25 Sep 2007 00:40:10 -0700

On Tuesday 25 September 2007 03:05, Martin Bayly wrote:
> 
> We have an application that performs searches against a Lucene based index
> and also against a Windows Desktop Search based index.
> 
> For simple queries we'd like to offer our users a consistent interface that
> allows them to build basic Lucene style queries using the 'MUST HAVE' (+),
> 'MUST NOT HAVE' (-) and 'SHOULD HAVE' style of operators as this is probably
> more intuitive for non 'Boolean Logic' literate users.  We would not allow
> them to use any grouping (parenthesis).
> 
> Clearly we can pass this directly to Lucene, but for the Windows Desktop
> Search we need to translate the Lucene style query into a more traditional
> Boolean query.  So this is the opposite of the much discussed Boolean Query
> to Lucene Query conversion.
> 
> I'm wondering if anyone has ever done this or whether there is a concept
> mismatch in there somewhere that will make it difficult to do?
> 
> My thought was that you could take the standard Lucene operators and simply
> group them together as follows:
> 
> e.g. (assuming the Lucene default OR operator)
> 
> Lucene: +a +b -c -d e f
> 
> would translate to:
> 
> (a AND b NOT c NOT d) OR (a AND b NOT c NOT d AND (e OR f))
> 
> If I put this back into Lucene (actually Lucene.NET but hopefully its the
> same) I get back:
> 
> (+a +b -c -d)(+a +b -c -d +(e f))
> 
> which I think is equivalent but not as concise!  But I have not tested this
> against a big index to see if it's equivalent and I have a suspicion that
> Lucene might score the two versions of the Lucene representation
> differently.  But that's probably not an issue provided the Boolean
> representation is semantically equivalent to the first Lucene
> representation.
> 
> Anyone ever tried this before or have any comments on whether my 'logic' is
> flawed!



Under the hood, the Scorer for a BooleanQuery, BooleanScorer2, does the 
conversion from + and - to boolean operators in slightly different, but more 
concise way.

It basically maps the boolean query syntax to four operators: AND, OR,
ANDNOT, ANDoptional.
The mapping is only basical because, as a Scorer, it needs to map to other
Scorers, and these are: ConjunctionScorer, DisjunctionSumScorer,
ReqExclScorer and ReqOptSumScorer, respectively.

For ANDoptional the first subquery is required, and the second one is 
optional. This is a bit like ANDNOT in which the first query is required,
and the second one is prohibited.

The addition of ANDoptional/ReqOptSumScorer allows the conciseness,
while keeping equivalent semantics.
So I think your logic is not flawed, it's just that the traditional set  of 
boolean operators is somehow incomplete for optional subqueries.

Mapping + and - to these four operators is not really straightforward,
among others because of the coordination factor, and because of
the many different possible situations.

You're invited to have a look at the source code of BooleanScorer2.
There are also test cases for the equivalence of the semantics, see the
TestBoolean* classes.


Regards,
Paul Elschot

P.S. When documents may be scored out of order, for some disjunctions
(OR), BooleanScorer is used for performance.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Translating Lucene Query Syntax to Traditional Boolean Syntax

Reply via email to