Hi all
I'm currently benchmarking Lucene to get an understanding of what
optimisations are available for long queries, and wanted to check what
the recommended approach is.
Unsurprisingly a naive approach to long queries (just keep adding SHOULD
clauses to a big BooleanQuery) scales close to linearly in the number of
terms, which beyond a certain point isn't good enough.
The obvious solution is to prune the query in order to reduce the number
of documents which need scoring, and this is easy to do, but has the
downside that none of the pruned terms are used for scoring.
In Xapian there's a handy query operator called OP_AND_MAYBE, where only
terms on the left-hand-side are used to select documents, with terms on
the right-hand-side used for scoring only. This performs much better for
long queries if less discriminative terms are moved onto the
right-hand-side.
I tried to replicate this approach in Lucene using the following query
(in QueryParser syntax):
+(some mandatory terms) and some other terms for scoring only
The presence of a MUST clause in the outer BooleanQuery forces the
remaining SHOULD clauses to be purely optional and not expand the set of
documents scored, so this has the right semantics. However the
performance benefit isn't there -- in a test with 200 query terms in
total, it quickly becomes slower than a plain flat BooleanQuery once the
number of terms in the mandatory part of the query exceeds 5 or so.
Interestingly it's much much faster (~40ms) when there's only one
mandatory term, than when there are two terms in the mandatory clause
(~2500ms), which leads me to suspect an obvious optimisation is being
missed.
Anyone have any ideas on this, pointers to other relevant query types or
optimisations available in Lucene 4, or on which parts of the
Query/Weight/Scorer code we'd need to change to speed up this kind of thing?
Cheers
-Matt
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org