Re: [I] Try encoding very frequent terms using a dense bitmap [lucene]

via GitHub Sun, 03 Mar 2024 07:44:20 -0800


msokolov commented on issue #13147:
URL: https://github.com/apache/lucene/issues/13147#issuecomment-1975201779


   I ran luceneutil over wikimediumall. The index size was slightly reduced: 
   
   ```
   65200   ../indices/baseline/facets
   18923720        ../indices/baseline/index
   18988924        ../indices/baseline
   65204   ../indices/candidate/facets
   18774956        ../indices/candidate/index
   18840164        ../indices/candidate
   ```
   in a microbenchmark where I indexed random doc-only postings I saw ~28% 
index size reduction.
   
   query performance does seem to have registered some actual change: 
   
   ```
                               TaskQPS baseline      StdDevQPS 
my_modified_version      StdDev                Pct diff p-value                 
                                        [178/1805]
                       OrHighNotLow      124.19      (6.0%)      111.98      
(6.7%)   -9.8% ( -21% -    3%) 0.000         
                        LowSpanNear        1.50      (1.1%)        1.42      
(1.2%)   -4.8% (  -7% -   -2%) 0.000
                  HighTermTitleSort       86.63      (3.0%)       82.70      
(2.2%)   -4.5% (  -9% -    0%) 0.000
                MedIntervalsOrdered        3.25      (4.3%)        3.11      
(4.3%)   -4.1% ( -12% -    4%) 0.003
                         OrHighHigh       23.47      (6.7%)       22.61      
(3.3%)   -3.7% ( -12% -    6%) 0.029
                LowIntervalsOrdered        4.20      (4.1%)        4.05      
(4.1%)   -3.5% ( -11% -    4%) 0.007                                            
                                    
                        AndHighHigh       25.46      (8.5%)       24.57      
(4.9%)   -3.5% ( -15% -   10%) 0.114  
        BrowseRandomLabelTaxoFacets        2.05     (14.8%)        1.98     
(11.0%)   -3.4% ( -25% -   26%) 0.405                                           
                                     
               HighIntervalsOrdered        2.09      (5.3%)        2.02      
(5.4%)   -3.1% ( -13% -    7%) 0.063                                            
                                    
                       HighSpanNear        4.25      (1.9%)        4.13      
(2.0%)   -2.8% (  -6% -    1%) 0.000
                          OrHighMed       43.34      (3.1%)       42.18      
(2.1%)   -2.7% (  -7% -    2%) 0.001
               BrowseDateTaxoFacets        2.78      (7.6%)        2.70      
(6.6%)   -2.7% ( -15% -   12%) 0.234                                            
                                    
          BrowseDayOfYearTaxoFacets        2.81      (7.2%)        2.74      
(6.2%)   -2.5% ( -14% -   11%) 0.236                                            
                                    
                            Prefix3      126.88      (2.3%)      123.78      
(3.5%)   -2.4% (  -8% -    3%) 0.009                                            
                                    
                        MedSpanNear       11.93      (0.9%)       11.65      
(1.1%)   -2.3% (  -4% -    0%) 0.000                                            
                                    
                       OrHighNotMed      141.45      (5.1%)      138.33      
(7.0%)   -2.2% ( -13% -   10%) 0.254                                            
                                    
                         AndHighMed       36.62      (5.6%)       35.82      
(3.1%)   -2.2% ( -10% -    6%) 0.124                                            
                                    
                          MedPhrase       67.69      (2.9%)       66.22      
(2.6%)   -2.2% (  -7% -    3%) 0.013                                            
                                    
                   HighSloppyPhrase       10.38      (1.6%)       10.20      
(1.5%)   -1.8% (  -4% -    1%) 0.000                                            
                                    
                             IntNRQ        8.57     (14.4%)        8.42     
(16.1%)   -1.8% ( -28% -   33%) 0.713
                           HighTerm      271.19      (4.0%)      266.87      
(5.1%)   -1.6% ( -10% -    7%) 0.271
                    MedSloppyPhrase        8.12      (1.9%)        8.00      
(2.5%)   -1.6% (  -5% -    2%) 0.028
                         HighPhrase       39.43      (3.8%)       38.94      
(3.1%)   -1.2% (  -7% -    5%) 0.257
                            MedTerm      235.50      (3.4%)      232.58      
(4.7%)   -1.2% (  -9% -    7%) 0.339
                          LowPhrase       46.81      (2.8%)       46.27      
(2.3%)   -1.2% (  -6% -    4%) 0.157
                      OrHighNotHigh      147.42      (4.7%)      145.78      
(6.2%)   -1.1% ( -11% -   10%) 0.525
                         TermDTSort       88.33      (2.8%)       87.38      
(1.8%)   -1.1% (  -5% -    3%) 0.151
              HighTermDayOfYearSort      152.37      (2.1%)      150.79      
(1.8%)   -1.0% (  -4% -    2%) 0.093
                            LowTerm      254.01      (1.9%)      251.72      
(2.6%)   -0.9% (  -5% -    3%) 0.207
                    LowSloppyPhrase       24.52      (0.9%)       24.32      
(1.4%)   -0.8% (  -3% -    1%) 0.029
                      OrNotHighHigh      199.37      (3.8%)      197.74      
(4.9%)   -0.8% (  -9% -    8%) 0.557
                  HighTermMonthSort     1581.75      (2.6%)     1569.14      
(2.1%)   -0.8% (  -5% -    4%) 0.292
                       OrNotHighMed      134.43      (2.7%)      133.51      
(3.3%)   -0.7% (  -6% -    5%) 0.471
                          OrHighLow      279.41      (2.1%)      277.84      
(2.2%)   -0.6% (  -4% -    3%) 0.412
                             Fuzzy1       64.73      (1.5%)       64.48      
(0.7%)   -0.4% (  -2% -    1%) 0.302
             OrHighMedDayTaxoFacets        3.84      (6.3%)        3.83      
(5.4%)   -0.4% ( -11% -   12%) 0.845
            AndHighMedDayTaxoFacets       31.84      (1.2%)       31.74      
(1.5%)   -0.3% (  -2% -    2%) 0.444
                             Fuzzy2       36.90      (1.3%)       36.80      
(0.8%)   -0.3% (  -2% -    1%) 0.383
        BrowseRandomLabelSSDVFacets        1.57      (5.5%)        1.57      
(3.8%)   -0.2% (  -9% -    9%) 0.906
                           PKLookup      140.43      (1.7%)      140.30      
(2.1%)   -0.1% (  -3% -    3%) 0.876
                         AndHighLow      279.44      (2.2%)      279.34      
(2.3%)   -0.0% (  -4% -    4%) 0.958
                       OrNotHighLow      345.34      (1.7%)      345.21      
(1.9%)   -0.0% (  -3% -    3%) 0.948
                            Respell       33.36      (1.5%)       33.38      
(1.3%)    0.1% (  -2% -    2%) 0.881
               MedTermDayTaxoFacets       10.12      (2.4%)       10.13      
(2.4%)    0.1% (  -4% -    4%) 0.912
          BrowseDayOfYearSSDVFacets        2.32      (5.4%)        2.33      
(3.3%)    0.1% (  -8% -    9%) 0.953
               HighTermTitleBDVSort        4.74      (3.3%)        4.74      
(4.0%)    0.1% (  -6% -    7%) 0.902                                            
                                    
                           Wildcard      136.61      (2.5%)      136.82      
(2.2%)    0.2% (  -4% -    4%) 0.831  
               BrowseDateSSDVFacets        0.68     (13.1%)        0.68     
(13.0%)    0.4% ( -22% -   30%) 0.928                                           
                                     
              BrowseMonthTaxoFacets        2.84      (3.8%)        2.87      
(1.3%)    1.1% (  -3% -    6%) 0.207                                            
                                    
              BrowseMonthSSDVFacets        2.38      (5.1%)        2.41      
(4.0%)    1.3% (  -7% -   11%) 0.362
           AndHighHighDayTaxoFacets        3.23      (3.6%)        3.34      
(2.9%)    3.5% (  -2% -   10%) 0.001
   ```
   
   so this looks positive. I can try tuning the decision parameter controlling 
which encoding to use to see what impact that may have. I guess what I wonder 
is whether the added complexity is worth chasing this, but I'm pretty 
encouraged that the overhead of the conditionals isn't overwhelming the 
"within-block skipping" this affords.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Try encoding very frequent terms using a dense bitmap [lucene]

Reply via email to