[ 
https://issues.apache.org/jira/browse/LUCENE-6894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001984#comment-15001984
 ] 

Stefan Pohl commented on LUCENE-6894:
-------------------------------------

There are very likely situations where you can still decrease query runtime 
even further with a different order of clauses than the one based on current 
worst-case estimates, and I agree that the naming 'cost()' doesn't really 
reflect the conservative estimates. However, any other non-worst-case estimate 
might err very badly and make queries that are currently reasonably fast 
extremely slow.

It comes down to trading in worst-case behavior to gain average/throughput, but 
usually people care more about the slowest/hardest queries. However, maybe we 
can have worst-case and other estimates too and choose to use the latter only 
in cases where even making the wrong decision won't be too bad, so that you're 
speculative on the fast queries to gain throughput, but conservative on 
potentially slow queries.

> Improve DISI.cost() by assuming independence for match probabilities
> --------------------------------------------------------------------
>
>                 Key: LUCENE-6894
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6894
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>            Reporter: Paul Elschot
>            Priority: Minor
>         Attachments: LUCENE-6894.patch
>
>
> The DocIdSetIterator.cost() method returns an estimation of the number of 
> matching docs. Currently conjunctions use the minimum cost, and disjunctions 
> use the sum of the costs, and both are too high.
> The probability of a match is estimated by dividing available cost() by the 
> number of docs in a segment.
> The conjunction probability is then the product of the inputs, and the 
> disjunction probability follows from De Morgan's rule:
> "not (A and B)" is the same as "(not A) or (not B)"
> with the probability for "not" computed as 1 minus the input probability.
> The independence that is assumed is normally not there. However, the cost() 
> results are only used to order the input DISIs/Scorers for optimization, and 
> for that I expect this assumption to work nicely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to