[PR] Bump the window size of disjunction from 2,048 to 4,096. [lucene]

via GitHub Wed, 24 Jul 2024 00:36:09 -0700


jpountz opened a new pull request, #13605:
URL: https://github.com/apache/lucene/pull/13605


   It's been pointed multiple times that a difference between Tantivy and 
Lucene is the fact that Tantivy uses windows of 4,096 docs when Lucene has a 2x 
smaller window size of 2,048 docs and that this might explain part of the 
performance difference. luceneutil suggests that bumping the window size to 
4,096 does indeed improve performance for counting queries, but not for top-k 
queries. I'm still suggesting to bump the window size across the board to keep 
our disjunction scorers consistent.
   
   ```
                               TaskQPS baseline      StdDevQPS 
my_modified_version      StdDev                Pct diff p-value
                        CountPhrase        3.27     (11.6%)        3.14      
(8.0%)   -4.1% ( -21% -   17%) 0.189
                  HighTermMonthSort     3521.28      (3.5%)     3481.74      
(2.8%)   -1.1% (  -7% -    5%) 0.262
                           PKLookup      289.42      (1.3%)      286.47      
(2.2%)   -1.0% (  -4% -    2%) 0.075
                         TermDTSort      352.01      (6.5%)      348.89      
(5.6%)   -0.9% ( -12% -   11%) 0.642
                             Phrase       11.85      (5.3%)       11.76      
(5.0%)   -0.8% ( -10% -    9%) 0.634
                          OrHighLow      772.82      (2.4%)      767.24      
(2.1%)   -0.7% (  -5% -    3%) 0.313
                    CountAndHighMed      120.78      (2.3%)      120.10      
(2.5%)   -0.6% (  -5% -    4%) 0.449
              HighTermDayOfYearSort      821.48      (3.5%)      818.62      
(2.7%)   -0.3% (  -6% -    6%) 0.724
                  HighTermTitleSort      148.84      (2.9%)      148.33      
(2.8%)   -0.3% (  -5% -    5%) 0.700
                        AndHighHigh       62.36      (1.7%)       62.17      
(1.8%)   -0.3% (  -3% -    3%) 0.584
                   CountAndHighHigh       41.41      (2.5%)       41.34      
(2.6%)   -0.2% (  -5% -    5%) 0.836
                             Fuzzy1       96.24      (1.0%)       96.09      
(1.2%)   -0.2% (  -2% -    2%) 0.667
                         AndHighLow      827.59      (2.7%)      826.89      
(2.4%)   -0.1% (  -5% -    5%) 0.918
                         AndHighMed       93.35      (1.6%)       93.29      
(1.7%)   -0.1% (  -3% -    3%) 0.903
               HighTermTitleBDVSort       16.30      (4.2%)       16.29      
(6.7%)   -0.0% ( -10% -   11%) 0.984
                          OrHighMed      153.42      (2.6%)      153.41      
(2.2%)   -0.0% (  -4% -    4%) 0.994
                            Respell       46.72      (1.3%)       46.72      
(1.4%)    0.0% (  -2% -    2%) 0.975
                          And3Terms      155.73      (2.2%)      155.95      
(1.4%)    0.1% (  -3% -    3%) 0.805
                             Fuzzy2       58.66      (0.9%)       58.77      
(1.1%)    0.2% (  -1% -    2%) 0.566
                         OrHighHigh       75.70      (2.6%)       75.90      
(2.3%)    0.3% (  -4% -    5%) 0.733
                          CountTerm     9110.00      (4.3%)     9142.10      
(3.2%)    0.4% (  -6% -    8%) 0.768
                       AndStopWords       29.47      (2.6%)       29.57      
(1.3%)    0.4% (  -3% -    4%) 0.579
                And2Terms2StopWords      150.30      (2.1%)      150.86      
(1.1%)    0.4% (  -2% -    3%) 0.487
                         OrHighRare      237.33      (5.7%)      238.26      
(6.2%)    0.4% ( -10% -   13%) 0.837
                            MedTerm      553.55      (6.0%)      555.97      
(7.7%)    0.4% ( -12% -   15%) 0.841
                           Wildcard       34.08      (3.2%)       34.25      
(3.4%)    0.5% (  -5% -    7%) 0.630
                       OrNotHighLow      761.70      (3.2%)      766.33      
(2.6%)    0.6% (  -5% -    6%) 0.511
                 Or2Terms2StopWords      156.10      (3.2%)      157.14      
(1.8%)    0.7% (  -4% -    5%) 0.416
                           Or3Terms      156.59      (3.0%)      157.70      
(1.9%)    0.7% (  -4% -    5%) 0.374
                           HighTerm      440.27      (5.6%)      443.89      
(7.5%)    0.8% ( -11% -   14%) 0.695
                            LowTerm      892.27      (5.2%)      900.48      
(6.8%)    0.9% ( -10% -   13%) 0.632
                        OrStopWords       31.88      (4.7%)       32.29      
(2.6%)    1.3% (  -5% -    9%) 0.276
                            Prefix3      214.22      (3.4%)      217.48      
(2.8%)    1.5% (  -4% -    8%) 0.124
                      OrHighNotHigh      247.52      (4.8%)      254.52      
(5.1%)    2.8% (  -6% -   13%) 0.071
                             IntNRQ      144.53     (17.2%)      148.66     
(17.9%)    2.9% ( -27% -   45%) 0.607
                       OrNotHighMed      330.23      (6.5%)      340.12      
(5.4%)    3.0% (  -8% -   15%) 0.114
                       OrHighNotMed      285.11      (5.2%)      293.82      
(6.2%)    3.1% (  -7% -   15%) 0.092
                       OrHighNotLow      429.94      (5.4%)      443.15      
(6.8%)    3.1% (  -8% -   16%) 0.113
                      OrNotHighHigh      189.30      (5.9%)      195.25      
(5.4%)    3.1% (  -7% -   15%) 0.079
                     CountOrHighMed       99.90     (22.5%)      121.78     
(20.0%)   21.9% ( -16% -   83%) 0.001
                    CountOrHighHigh       53.76     (35.1%)       70.24     
(32.5%)   30.6% ( -27% -  151%) 0.004
   ```
   
   ### Description
   
   <!--
   If this is your first contribution to Lucene, please make sure you have 
reviewed the contribution guide.
   https://github.com/apache/lucene/blob/main/CONTRIBUTING.md
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Bump the window size of disjunction from 2,048 to 4,096. [lucene]

Reply via email to