msokolov commented on issue #13147:
URL: https://github.com/apache/lucene/issues/13147#issuecomment-1975201779
I ran luceneutil over wikimediumall. The index size was slightly reduced:
```
65200 ../indices/baseline/facets
18923720 ../indices/baseline/index
18988924 ../indices/baseline
65204 ../indices/candidate/facets
18774956 ../indices/candidate/index
18840164 ../indices/candidate
```
in a microbenchmark where I indexed random doc-only postings I saw ~28%
index size reduction.
query performance does seem to have registered some actual change:
```
TaskQPS baseline StdDevQPS
my_modified_version StdDev Pct diff p-value
[178/1805]
OrHighNotLow 124.19 (6.0%) 111.98
(6.7%) -9.8% ( -21% - 3%) 0.000
LowSpanNear 1.50 (1.1%) 1.42
(1.2%) -4.8% ( -7% - -2%) 0.000
HighTermTitleSort 86.63 (3.0%) 82.70
(2.2%) -4.5% ( -9% - 0%) 0.000
MedIntervalsOrdered 3.25 (4.3%) 3.11
(4.3%) -4.1% ( -12% - 4%) 0.003
OrHighHigh 23.47 (6.7%) 22.61
(3.3%) -3.7% ( -12% - 6%) 0.029
LowIntervalsOrdered 4.20 (4.1%) 4.05
(4.1%) -3.5% ( -11% - 4%) 0.007
AndHighHigh 25.46 (8.5%) 24.57
(4.9%) -3.5% ( -15% - 10%) 0.114
BrowseRandomLabelTaxoFacets 2.05 (14.8%) 1.98
(11.0%) -3.4% ( -25% - 26%) 0.405
HighIntervalsOrdered 2.09 (5.3%) 2.02
(5.4%) -3.1% ( -13% - 7%) 0.063
HighSpanNear 4.25 (1.9%) 4.13
(2.0%) -2.8% ( -6% - 1%) 0.000
OrHighMed 43.34 (3.1%) 42.18
(2.1%) -2.7% ( -7% - 2%) 0.001
BrowseDateTaxoFacets 2.78 (7.6%) 2.70
(6.6%) -2.7% ( -15% - 12%) 0.234
BrowseDayOfYearTaxoFacets 2.81 (7.2%) 2.74
(6.2%) -2.5% ( -14% - 11%) 0.236
Prefix3 126.88 (2.3%) 123.78
(3.5%) -2.4% ( -8% - 3%) 0.009
MedSpanNear 11.93 (0.9%) 11.65
(1.1%) -2.3% ( -4% - 0%) 0.000
OrHighNotMed 141.45 (5.1%) 138.33
(7.0%) -2.2% ( -13% - 10%) 0.254
AndHighMed 36.62 (5.6%) 35.82
(3.1%) -2.2% ( -10% - 6%) 0.124
MedPhrase 67.69 (2.9%) 66.22
(2.6%) -2.2% ( -7% - 3%) 0.013
HighSloppyPhrase 10.38 (1.6%) 10.20
(1.5%) -1.8% ( -4% - 1%) 0.000
IntNRQ 8.57 (14.4%) 8.42
(16.1%) -1.8% ( -28% - 33%) 0.713
HighTerm 271.19 (4.0%) 266.87
(5.1%) -1.6% ( -10% - 7%) 0.271
MedSloppyPhrase 8.12 (1.9%) 8.00
(2.5%) -1.6% ( -5% - 2%) 0.028
HighPhrase 39.43 (3.8%) 38.94
(3.1%) -1.2% ( -7% - 5%) 0.257
MedTerm 235.50 (3.4%) 232.58
(4.7%) -1.2% ( -9% - 7%) 0.339
LowPhrase 46.81 (2.8%) 46.27
(2.3%) -1.2% ( -6% - 4%) 0.157
OrHighNotHigh 147.42 (4.7%) 145.78
(6.2%) -1.1% ( -11% - 10%) 0.525
TermDTSort 88.33 (2.8%) 87.38
(1.8%) -1.1% ( -5% - 3%) 0.151
HighTermDayOfYearSort 152.37 (2.1%) 150.79
(1.8%) -1.0% ( -4% - 2%) 0.093
LowTerm 254.01 (1.9%) 251.72
(2.6%) -0.9% ( -5% - 3%) 0.207
LowSloppyPhrase 24.52 (0.9%) 24.32
(1.4%) -0.8% ( -3% - 1%) 0.029
OrNotHighHigh 199.37 (3.8%) 197.74
(4.9%) -0.8% ( -9% - 8%) 0.557
HighTermMonthSort 1581.75 (2.6%) 1569.14
(2.1%) -0.8% ( -5% - 4%) 0.292
OrNotHighMed 134.43 (2.7%) 133.51
(3.3%) -0.7% ( -6% - 5%) 0.471
OrHighLow 279.41 (2.1%) 277.84
(2.2%) -0.6% ( -4% - 3%) 0.412
Fuzzy1 64.73 (1.5%) 64.48
(0.7%) -0.4% ( -2% - 1%) 0.302
OrHighMedDayTaxoFacets 3.84 (6.3%) 3.83
(5.4%) -0.4% ( -11% - 12%) 0.845
AndHighMedDayTaxoFacets 31.84 (1.2%) 31.74
(1.5%) -0.3% ( -2% - 2%) 0.444
Fuzzy2 36.90 (1.3%) 36.80
(0.8%) -0.3% ( -2% - 1%) 0.383
BrowseRandomLabelSSDVFacets 1.57 (5.5%) 1.57
(3.8%) -0.2% ( -9% - 9%) 0.906
PKLookup 140.43 (1.7%) 140.30
(2.1%) -0.1% ( -3% - 3%) 0.876
AndHighLow 279.44 (2.2%) 279.34
(2.3%) -0.0% ( -4% - 4%) 0.958
OrNotHighLow 345.34 (1.7%) 345.21
(1.9%) -0.0% ( -3% - 3%) 0.948
Respell 33.36 (1.5%) 33.38
(1.3%) 0.1% ( -2% - 2%) 0.881
MedTermDayTaxoFacets 10.12 (2.4%) 10.13
(2.4%) 0.1% ( -4% - 4%) 0.912
BrowseDayOfYearSSDVFacets 2.32 (5.4%) 2.33
(3.3%) 0.1% ( -8% - 9%) 0.953
HighTermTitleBDVSort 4.74 (3.3%) 4.74
(4.0%) 0.1% ( -6% - 7%) 0.902
Wildcard 136.61 (2.5%) 136.82
(2.2%) 0.2% ( -4% - 4%) 0.831
BrowseDateSSDVFacets 0.68 (13.1%) 0.68
(13.0%) 0.4% ( -22% - 30%) 0.928
BrowseMonthTaxoFacets 2.84 (3.8%) 2.87
(1.3%) 1.1% ( -3% - 6%) 0.207
BrowseMonthSSDVFacets 2.38 (5.1%) 2.41
(4.0%) 1.3% ( -7% - 11%) 0.362
AndHighHighDayTaxoFacets 3.23 (3.6%) 3.34
(2.9%) 3.5% ( -2% - 10%) 0.001
```
so this looks positive. I can try tuning the decision parameter controlling
which encoding to use to see what impact that may have. I guess what I wonder
is whether the added complexity is worth chasing this, but I'm pretty
encouraged that the overhead of the conditionals isn't overwhelming the
"within-block skipping" this affords.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]