[jira] [Commented] (LUCENE-6894) Improve DISI.cost() by assuming independence for match probabilities

Paul Elschot (JIRA) Sun, 15 Nov 2015 13:59:36 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-6894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15006052#comment-15006052
 ]


Paul Elschot commented on LUCENE-6894:
--------------------------------------

Here is the benchmark output, it might be good for future reference:
{code}
                    TaskQPS baseline      StdDevQPS my_modified_version      
StdDev                Pct diff
                HighTerm      178.20      (1.8%)      174.50      (5.6%)   
-2.1% (  -9% -    5%)
                 MedTerm      641.36      (1.6%)      630.32      (4.9%)   
-1.7% (  -8% -    4%)
              OrHighHigh       57.02      (5.5%)       56.32      (6.6%)   
-1.2% ( -12% -   11%)
               OrHighMed      107.80      (5.2%)      106.89      (6.2%)   
-0.8% ( -11% -   11%)
             AndHighHigh      100.02      (2.3%)       99.34      (0.7%)   
-0.7% (  -3% -    2%)
                 LowTerm     2477.28      (3.0%)     2463.27      (5.5%)   
-0.6% (  -8% -    8%)
              AndHighMed      627.58      (1.5%)      625.22      (1.2%)   
-0.4% (  -3% -    2%)
              HighPhrase       81.21      (4.2%)       80.98      (4.3%)   
-0.3% (  -8% -    8%)
               OrHighLow      136.70      (3.1%)      136.35      (2.1%)   
-0.3% (  -5% -    5%)
               LowPhrase      181.55      (2.2%)      181.09      (2.0%)   
-0.3% (  -4% -    4%)
         MedSloppyPhrase       56.03      (2.9%)       55.93      (3.1%)   
-0.2% (  -5% -    5%)
             MedSpanNear       52.77      (1.7%)       52.68      (2.6%)   
-0.2% (  -4% -    4%)
         LowSloppyPhrase      106.15      (2.9%)      106.01      (3.1%)   
-0.1% (  -6% -    6%)
               MedPhrase       39.38      (3.8%)       39.36      (3.3%)   
-0.1% (  -6% -    7%)
                  Fuzzy1      137.14      (2.1%)      137.06      (1.5%)   
-0.1% (  -3% -    3%)
                  Fuzzy2       79.28      (1.9%)       79.25      (1.5%)   
-0.0% (  -3% -    3%)
             LowSpanNear       94.38      (1.7%)       94.35      (2.8%)   
-0.0% (  -4% -    4%)
            OrNotHighMed      444.12      (1.7%)      444.36      (1.2%)    
0.1% (  -2% -    2%)
              AndHighLow     1878.59      (2.0%)     1880.20      (1.9%)    
0.1% (  -3% -    4%)
                 Respell      106.47      (1.9%)      106.62      (1.7%)    
0.1% (  -3% -    3%)
            OrNotHighLow     1831.85      (1.7%)     1834.68      (1.3%)    
0.2% (  -2% -    3%)
           OrNotHighHigh       69.75      (1.6%)       69.91      (1.4%)    
0.2% (  -2% -    3%)
            HighSpanNear       36.38      (2.8%)       36.47      (3.8%)    
0.3% (  -6% -    7%)
        HighSloppyPhrase       45.58      (3.6%)       45.70      (3.5%)    
0.3% (  -6% -    7%)
            OrHighNotLow       65.78      (7.0%)       66.03      (8.4%)    
0.4% ( -14% -   16%)
                 Prefix3      448.85      (3.5%)      450.67      (3.8%)    
0.4% (  -6% -    8%)
                Wildcard      114.35      (4.8%)      115.02      (4.6%)    
0.6% (  -8% -   10%)
                  IntNRQ       23.48      (7.4%)       23.71      (7.7%)    
1.0% ( -13% -   17%)
                PKLookup      360.70      (1.7%)      364.91      (3.1%)    
1.2% (  -3% -    6%)
            OrHighNotMed      178.99      (7.2%)      181.91      (8.2%)    
1.6% ( -12% -   18%)
           OrHighNotHigh       39.78      (7.1%)       40.63      (7.5%)    
2.1% ( -11% -   18%)

{code}

> Improve DISI.cost() by assuming independence for match probabilities
> --------------------------------------------------------------------
>
>                 Key: LUCENE-6894
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6894
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>            Reporter: Paul Elschot
>            Priority: Minor
>         Attachments: LUCENE-6894.patch, LUCENE-6894.patch
>
>
> The DocIdSetIterator.cost() method returns an estimation of the number of 
> matching docs. Currently conjunctions use the minimum cost, and disjunctions 
> use the sum of the costs, and both are too high.
> The probability of a match is estimated by dividing available cost() by the 
> number of docs in a segment.
> The conjunction probability is then the product of the inputs, and the 
> disjunction probability follows from De Morgan's rule:
> "not (A and B)" is the same as "(not A) or (not B)"
> with the probability for "not" computed as 1 minus the input probability.
> The independence that is assumed is normally not there. However, the cost() 
> results are only used to order the input DISIs/Scorers for optimization, and 
> for that I expect this assumption to work nicely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-6894) Improve DISI.cost() by assuming independence for match probabilities

Reply via email to