[jira] [Comment Edited] (LUCENE-6894) Improve DISI.cost() by assuming independence for match probabilities

Paul Elschot (JIRA) Thu, 12 Nov 2015 14:27:23 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-6894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15003039#comment-15003039
 ]


Paul Elschot edited comment on LUCENE-6894 at 11/12/15 10:27 PM:
-----------------------------------------------------------------

Another reason why I started this is that the result of cost() is also used as 
weights for matchCost() at LUCENE-6276, and I'd prefer those weights to be as 
accurate as reasonably possible.

I think we can keep this (assuming independence for conjunctions and 
disjunctions) as a possible alternative until the current implementation gives 
a bad result.

For the proximity queries (Phrases, Spans) this reduces the conjunction cost() 
using the allowed slop.
Would it be worthwhile to open a separate issue for that?



was (Author: [email protected]):
Another reason why I started this is that the result of cost() is also used as 
weights for matchCost() at LUCENE-6276, and I'd prefer those weights to be as 
accurate as reasonably possible.

I think we can keep this alternative (assuming independence for conjunctions 
and disjunctions) as a possible alternative until the current implementation 
gives a bad result.

For the proximity queries (Phrases, Spans) this reduces the conjunction cost() 
using the allowed slop.
Would it be worthwhile to open a separate issue for that?


> Improve DISI.cost() by assuming independence for match probabilities
> --------------------------------------------------------------------
>
>                 Key: LUCENE-6894
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6894
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>            Reporter: Paul Elschot
>            Priority: Minor
>         Attachments: LUCENE-6894.patch
>
>
> The DocIdSetIterator.cost() method returns an estimation of the number of 
> matching docs. Currently conjunctions use the minimum cost, and disjunctions 
> use the sum of the costs, and both are too high.
> The probability of a match is estimated by dividing available cost() by the 
> number of docs in a segment.
> The conjunction probability is then the product of the inputs, and the 
> disjunction probability follows from De Morgan's rule:
> "not (A and B)" is the same as "(not A) or (not B)"
> with the probability for "not" computed as 1 minus the input probability.
> The independence that is assumed is normally not there. However, the cost() 
> results are only used to order the input DISIs/Scorers for optimization, and 
> for that I expect this assumption to work nicely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-6894) Improve DISI.cost() by assuming independence for match probabilities

Reply via email to