[PR] Inline skip data into postings lists [lucene]

via GitHub Thu, 18 Jul 2024 07:06:20 -0700


jpountz opened a new pull request, #13585:
URL: https://github.com/apache/lucene/pull/13585


   This updates the postings format in order to inline skip data into postings. 
This format is generally similar to the current `Lucene99PostingsFormat`, e.g. 
it shares the same block encoding logic, but it has a few differences:
    - Skip data is inlined into postings to make the access pattern more 
sequential.
    - There are only 2 levels of skip data: on every block (128 docs) and every 
32 blocks (4096 docs).
   
   In general, I found that the fact that skip data is inlined may slow down a 
bit queries that don't need skip data at all (e.g. `CountOrXXX` tasks that 
never advance of consult impacts) and speed up a bit queries that advance by 
small intervals. The fact that the greatest level only allows skipping 4096 
docs at once means that we're slower at advancing by large intervals, but data 
suggests that it doesn't significantly hurt performance. Phrase queries and 
term queries sorted by field are slower for reasons that I haven't understood 
yet.
   
   These results were produced in wikibigall without inter-segment concurrency.
   
   ```
                               TaskQPS baseline      StdDevQPS 
my_modified_version      StdDev                Pct diff p-value
                  HighTermTitleSort      152.82      (1.3%)      105.67      
(0.9%)  -30.9% ( -32% -  -29%) 0.000
                             Phrase       11.67      (5.2%)       10.13      
(4.2%)  -13.2% ( -21% -   -4%) 0.000
                    CountOrHighHigh       56.79     (33.3%)       49.41     
(21.1%)  -13.0% ( -50% -   62%) 0.141
                  HighTermMonthSort     3598.70      (3.2%)     3372.04      
(2.7%)   -6.3% ( -11% -    0%) 0.000
                     CountOrHighMed      104.44     (21.2%)       99.90     
(18.1%)   -4.3% ( -36% -   44%) 0.486
                           Wildcard       54.26      (3.0%)       52.23      
(2.6%)   -3.7% (  -9% -    1%) 0.000
                         TermDTSort      349.67      (6.0%)      339.57      
(4.3%)   -2.9% ( -12% -    7%) 0.081
                             IntNRQ      113.09     (21.2%)      110.12     
(21.6%)   -2.6% ( -37% -   51%) 0.699
                          CountTerm     9104.21      (4.1%)     8870.31      
(6.0%)   -2.6% ( -12% -    7%) 0.115
                            Prefix3      296.80      (1.9%)      290.04      
(2.0%)   -2.3% (  -6% -    1%) 0.000
                           HighTerm      383.13      (5.2%)      377.50      
(7.5%)   -1.5% ( -13% -   11%) 0.472
                           PKLookup      286.07      (1.5%)      281.91      
(2.1%)   -1.5% (  -4% -    2%) 0.012
              HighTermDayOfYearSort      758.57      (2.6%)      748.44      
(2.9%)   -1.3% (  -6% -    4%) 0.121
               HighTermTitleBDVSort       13.27      (4.9%)       13.13      
(6.2%)   -1.1% ( -11% -   10%) 0.546
                             Fuzzy1       98.52      (1.7%)       97.67      
(2.1%)   -0.9% (  -4% -    3%) 0.154
                        AndHighHigh       62.93      (1.9%)       62.46      
(1.5%)   -0.7% (  -4% -    2%) 0.164
                             Fuzzy2       62.42      (1.5%)       61.96      
(1.9%)   -0.7% (  -4% -    2%) 0.184
                            Respell       49.68      (1.3%)       49.39      
(1.5%)   -0.6% (  -3% -    2%) 0.171
                 Or2Terms2StopWords      157.28      (1.7%)      157.04      
(1.7%)   -0.2% (  -3% -    3%) 0.777
                         OrHighHigh       72.02      (1.7%)       72.21      
(1.8%)    0.3% (  -3% -    3%) 0.642
                       AndStopWords       29.81      (2.2%)       29.94      
(1.7%)    0.4% (  -3% -    4%) 0.495
                And2Terms2StopWords      151.81      (1.5%)      152.86      
(1.8%)    0.7% (  -2% -    4%) 0.183
                       OrHighNotLow      384.08      (5.0%)      388.68      
(6.9%)    1.2% ( -10% -   13%) 0.531
                      OrHighNotHigh      210.18      (6.1%)      213.18      
(7.3%)    1.4% ( -11% -   15%) 0.502
                       OrHighNotMed      324.28      (5.3%)      329.41      
(6.8%)    1.6% (  -9% -   14%) 0.413
                            MedTerm      567.00      (5.4%)      578.90      
(8.1%)    2.1% ( -10% -   16%) 0.333
                        CountPhrase        3.24     (10.3%)        3.31     
(13.2%)    2.2% ( -19% -   28%) 0.551
                            LowTerm      854.03      (4.9%)      873.32      
(7.2%)    2.3% (  -9% -   15%) 0.248
                         AndHighMed      197.59      (1.5%)      203.05      
(2.2%)    2.8% (   0% -    6%) 0.000
                      OrNotHighHigh      178.76      (6.5%)      184.38      
(7.5%)    3.1% ( -10% -   18%) 0.156
                        OrStopWords       32.36      (2.8%)       33.56      
(1.7%)    3.7% (   0% -    8%) 0.000
                           Or3Terms      158.54      (1.6%)      164.51      
(2.1%)    3.8% (   0% -    7%) 0.000
                          OrHighMed      231.23      (1.8%)      241.40      
(2.9%)    4.4% (   0% -    9%) 0.000
                          And3Terms      157.12      (1.3%)      164.32      
(1.5%)    4.6% (   1% -    7%) 0.000
                          OrHighLow      732.71      (1.6%)      786.67      
(3.1%)    7.4% (   2% -   12%) 0.000
                       OrNotHighMed      282.64      (6.5%)      306.83      
(8.5%)    8.6% (  -6% -   25%) 0.000
                         OrHighRare      237.87      (7.8%)      259.37      
(4.6%)    9.0% (  -3% -   23%) 0.000
                       OrNotHighLow      833.05      (2.4%)      946.10      
(3.8%)   13.6% (   7% -   20%) 0.000
                   CountAndHighHigh       41.24      (2.0%)       46.91      
(2.7%)   13.8% (   8% -   18%) 0.000
                         AndHighLow      748.77      (1.7%)      870.25      
(3.1%)   16.2% (  11% -   21%) 0.000
                    CountAndHighMed      120.32      (2.0%)      140.26      
(3.5%)   16.6% (  10% -   22%) 0.000
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Inline skip data into postings lists [lucene]

Reply via email to