[jira] [Updated] (LUCENE-6184) BooleanScorer should better deal with sparse clauses

Adrien Grand (JIRA) Thu, 15 Jan 2015 07:17:10 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-6184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Adrien Grand updated LUCENE-6184:
---------------------------------
    Attachment: LUCENE-6184.patch

Here is a patch:
 - BulkScorer now returns a hint on the next matching doc after {{max}}
 - BooleanScorer uses this information in order to only score windows of 
documents where at least one clause matches (by putting the bulk scorers into a 
priority queue)

This helps boolean queries with dense clauses since this helped remove the 
{{hasMatches}} optimization which helps not iterate over the bit set if there 
are no matches but had the drawback of making OrCollector.collect
heavier.

And this helps boolean queries with very sparse clauses since they now only 
collect windows where they have matches.

Here is the result of the luceneutil benchmark on the 10M wikipedia corpus. I 
added some tasks to test sparse clauses: VeryLow is for term queries that have 
a doc freq between 400 and 500, and "VeryLowVeryLow" is a disjunction of 2 such 
terms:

{code}
                    TaskQPS baseline      StdDev   QPS patch      StdDev        
        Pct diff
        HighSloppyPhrase       32.70      (4.3%)       32.39      (4.0%)   
-1.0% (  -8% -    7%)
                 Prefix3      162.73      (5.8%)      161.32      (6.6%)   
-0.9% ( -12% -   12%)
                 LowTerm      803.22      (6.2%)      797.47      (6.2%)   
-0.7% ( -12% -   12%)
                  IntNRQ       13.84      (6.9%)       13.75      (7.3%)   
-0.7% ( -13% -   14%)
            OrHighNotLow       60.36      (2.7%)       59.96      (3.9%)   
-0.7% (  -7% -    6%)
         LowSloppyPhrase       17.94      (3.0%)       17.82      (2.8%)   
-0.7% (  -6% -    5%)
                 VeryLow     6095.14      (5.8%)     6057.73      (5.0%)   
-0.6% ( -10% -   10%)
               LowPhrase      276.59      (2.2%)      274.97      (1.6%)   
-0.6% (  -4% -    3%)
            OrHighNotMed       43.56      (2.6%)       43.32      (3.3%)   
-0.6% (  -6% -    5%)
            OrNotHighLow      924.37      (2.5%)      919.21      (2.4%)   
-0.6% (  -5% -    4%)
              AndHighLow      703.38      (2.9%)      699.62      (3.6%)   
-0.5% (  -6% -    6%)
                Wildcard       93.74      (3.1%)       93.29      (3.0%)   
-0.5% (  -6% -    5%)
         MedSloppyPhrase       79.24      (2.8%)       78.91      (2.3%)   
-0.4% (  -5% -    4%)
            OrNotHighMed      207.14      (2.0%)      206.31      (2.2%)   
-0.4% (  -4% -    3%)
            HighSpanNear       12.56      (0.9%)       12.53      (1.1%)   
-0.2% (  -2% -    1%)
              HighPhrase       13.58      (2.3%)       13.55      (2.1%)   
-0.2% (  -4% -    4%)
           OrHighNotHigh       33.29      (1.6%)       33.24      (2.0%)   
-0.2% (  -3% -    3%)
           OrNotHighHigh       56.10      (1.6%)       56.00      (1.8%)   
-0.2% (  -3% -    3%)
                HighTerm       91.52      (2.6%)       91.37      (2.7%)   
-0.2% (  -5% -    5%)
                 Respell       71.63      (5.5%)       71.52      (5.3%)   
-0.1% ( -10% -   11%)
             LowSpanNear       18.17      (1.0%)       18.16      (0.8%)   
-0.1% (  -1% -    1%)
                 MedTerm      146.69      (2.5%)      146.56      (3.0%)   
-0.1% (  -5% -    5%)
              AndHighMed      274.22      (2.6%)      274.00      (2.3%)   
-0.1% (  -4% -    4%)
             MedSpanNear       31.01      (0.9%)       31.00      (1.1%)   
-0.0% (  -1% -    1%)
             AndHighHigh       77.34      (1.8%)       77.32      (1.7%)   
-0.0% (  -3% -    3%)
               MedPhrase       19.10      (6.2%)       19.10      (6.2%)    
0.0% ( -11% -   13%)
                  Fuzzy2       26.84      (6.8%)       26.88      (7.6%)    
0.1% ( -13% -   15%)
                PKLookup      272.91      (3.1%)      274.16      (2.7%)    
0.5% (  -5% -    6%)
               OrHighMed       59.25     (11.8%)       62.90      (6.5%)    
6.2% ( -10% -   27%)
               OrHighLow       64.54     (11.9%)       68.73      (6.5%)    
6.5% ( -10% -   28%)
              OrHighHigh       42.89     (12.2%)       45.77      (6.9%)    
6.7% ( -11% -   29%)
                  Fuzzy1       95.20      (4.2%)      101.65      (5.9%)    
6.8% (  -3% -   17%)
          VeryLowVeryLow     1936.31      (3.2%)     2263.44      (3.3%)   
16.9% (  10% -   24%)
{code}

> BooleanScorer should better deal with sparse clauses
> ----------------------------------------------------
>
>                 Key: LUCENE-6184
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6184
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: Trunk, 5.1
>
>         Attachments: LUCENE-6184.patch
>
>
> The way that BooleanScorer works looks like this:
> {code}
> for each (window of 2048 docs) {
>   for each (optional scorer) {
>     scorer.score(window)
>   }
> }
> {code}
> This is not efficient for very sparse clauses (doc freq much lower than 
> maxDoc/2048) since we keep on scoring windows of documents that do not match 
> anything. BooleanScorer2 currently performs better in those cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-6184) BooleanScorer should better deal with sparse clauses

Reply via email to