I have run the luceneutil benchmark with higher iterations and repeat
count but they are still very noisy, which I blame for running those
benchmarks on a laptop.
The results always show some of the facets tasks having speed ups
while others having small slowdowns. One run that clearly shows a
slowdown is HighTermTitleBDVSort which I expect as we are reading
those bytes on heap using a BytesRefBuilder now. The only way to
prevent this slow down would be to make the off-heap bytesref thingy
to be able to implement the Comparable interface efficiently or
alternatively expose the doc values max length so implementations can
do the same as we are doing today.
TaskQPS baseline StdDevQPS
my_modified_version StdDev Pct diff p-value
HighTermTitleBDVSort 4.30 (2.8%) 4.19
(2.0%) -2.5% ( -7% - 2%) 0.000
BrowseMonthTaxoFacets 10.25 (26.1%) 10.06
(26.4%) -1.8% ( -43% - 68%) 0.623
TermDTSort 216.72 (7.3%) 213.19
(6.7%) -1.6% ( -14% - 13%) 0.100
OrNotHighMed 464.01 (3.5%) 459.77
(3.5%) -0.9% ( -7% - 6%) 0.066
Wildcard 305.50 (4.1%) 303.08
(3.9%) -0.8% ( -8% - 7%) 0.164
Respell 60.92 (2.5%) 60.51
(2.4%) -0.7% ( -5% - 4%) 0.055
AndHighLow 1406.33 (3.1%) 1397.09
(3.3%) -0.7% ( -6% - 5%) 0.151
OrHighNotHigh 435.53 (4.8%) 432.71
(5.6%) -0.6% ( -10% - 10%) 0.383
LowTerm 931.83 (3.5%) 925.97
(3.2%) -0.6% ( -7% - 6%) 0.185
HighTermDayOfYearSort 410.08 (4.9%) 407.77
(4.9%) -0.6% ( -9% - 9%) 0.416
AndHighHighDayTaxoFacets 6.45 (2.1%) 6.42
(2.4%) -0.6% ( -4% - 3%) 0.081
MedPhrase 117.99 (2.7%) 117.34
(2.7%) -0.6% ( -5% - 4%) 0.142
LowPhrase 24.97 (2.9%) 24.84
(3.1%) -0.5% ( -6% - 5%) 0.210
AndHighMedDayTaxoFacets 18.40 (2.0%) 18.31
(2.3%) -0.5% ( -4% - 3%) 0.095
HighTerm 793.15 (4.4%) 789.25
(5.2%) -0.5% ( -9% - 9%) 0.469
OrHighMedDayTaxoFacets 4.64 (3.3%) 4.62
(3.7%) -0.5% ( -7% - 6%) 0.334
OrHighMed 188.68 (2.6%) 187.79
(3.1%) -0.5% ( -6% - 5%) 0.244
OrNotHighLow 1277.67 (2.5%) 1271.99
(2.9%) -0.4% ( -5% - 5%) 0.245
HighPhrase 59.28 (3.6%) 59.03
(3.4%) -0.4% ( -7% - 6%) 0.402
Fuzzy2 82.34 (2.2%) 82.02
(2.2%) -0.4% ( -4% - 4%) 0.205
OrNotHighHigh 688.74 (3.8%) 686.08
(4.4%) -0.4% ( -8% - 8%) 0.507
MedTermDayTaxoFacets 10.69 (3.2%) 10.65
(2.9%) -0.4% ( -6% - 5%) 0.382
OrHighNotMed 722.62 (4.8%) 720.03
(5.7%) -0.4% ( -10% - 10%) 0.630
MedTerm 979.84 (3.5%) 976.55
(4.0%) -0.3% ( -7% - 7%) 0.528
OrHighLow 780.09 (2.7%) 777.60
(2.8%) -0.3% ( -5% - 5%) 0.413
OrHighNotLow 639.07 (5.1%) 637.34
(6.0%) -0.3% ( -10% - 11%) 0.728
Prefix3 371.35 (2.3%) 370.43
(1.9%) -0.2% ( -4% - 4%) 0.409
AndHighMed 127.87 (2.4%) 127.63
(2.8%) -0.2% ( -5% - 5%) 0.613
HighIntervalsOrdered 4.88 (5.7%) 4.88
(5.8%) -0.1% ( -11% - 12%) 0.882
HighTermMonthSort 2844.77 (2.7%) 2841.57
(2.8%) -0.1% ( -5% - 5%) 0.770
OrHighHigh 56.75 (1.8%) 56.70
(2.1%) -0.1% ( -3% - 3%) 0.793
MedIntervalsOrdered 7.42 (3.7%) 7.41
(3.8%) -0.1% ( -7% - 7%) 0.894
LowIntervalsOrdered 22.00 (3.6%) 21.99
(3.6%) -0.0% ( -7% - 7%) 0.930
MedSloppyPhrase 34.10 (2.3%) 34.11
(2.3%) 0.0% ( -4% - 4%) 0.931
range 7621.67 (4.4%) 7625.28
(3.8%) 0.0% ( -7% - 8%) 0.935
HighTermTitleSort 208.38 (2.7%) 208.52
(2.6%) 0.1% ( -5% - 5%) 0.858
LowSpanNear 9.95 (1.4%) 9.96
(1.5%) 0.1% ( -2% - 3%) 0.668
Fuzzy1 70.10 (2.4%) 70.23
(2.6%) 0.2% ( -4% - 5%) 0.616
PKLookup 262.66 (3.9%) 263.16
(4.3%) 0.2% ( -7% - 8%) 0.743
HighSloppyPhrase 7.09 (3.7%) 7.11
(4.2%) 0.2% ( -7% - 8%) 0.732
AndHighHigh 49.37 (2.2%) 49.47
(2.5%) 0.2% ( -4% - 4%) 0.538
LowSloppyPhrase 186.55 (6.7%) 186.96
(6.7%) 0.2% ( -12% - 14%) 0.813
MedSpanNear 33.62 (2.6%) 33.71
(2.8%) 0.2% ( -5% - 5%) 0.523
IntNRQ 25.87 (5.7%) 25.94
(5.0%) 0.3% ( -9% - 11%) 0.713
HighSpanNear 11.58 (3.1%) 11.61
(3.2%) 0.3% ( -5% - 6%) 0.528
BrowseDayOfYearTaxoFacets 6.98 (8.0%) 7.08
(12.6%) 1.4% ( -17% - 23%) 0.333
BrowseDateTaxoFacets 6.90 (8.2%) 7.01
(12.8%) 1.5% ( -17% - 24%) 0.321
BrowseRandomLabelSSDVFacets 5.04 (8.6%) 5.15
(11.5%) 2.1% ( -16% - 24%) 0.148
BrowseDateSSDVFacets 1.37 (15.9%) 1.40
(16.0%) 2.1% ( -25% - 40%) 0.341
BrowseDayOfYearSSDVFacets 6.24 (10.4%) 6.39
(12.9%) 2.4% ( -18% - 28%) 0.148
BrowseRandomLabelTaxoFacets 5.65 (8.3%) 5.80
(22.2%) 2.5% ( -25% - 35%) 0.286
BrowseMonthSSDVFacets 6.24 (10.7%) 6.40
(13.4%) 2.6% ( -19% - 29%) 0.136
On Sat, Dec 7, 2024 at 9:08 PM Adrien Grand <[email protected]> wrote:
>
> FWIW I have also seen some users store sparse vectors or bloom filters in
> binary doc values. In both cases, the serialized size may be non negligible
> while not all bytes are needed. This change would likely help.
>
> Having the binary sort and faceting tasks not show a big slowdown would be
> good as these should be the worst case scenario for this change, as all bytes
> need to be read?
>
> @Ignacio Your luceneutil results show a couple significant speedups and small
> slowdowns but the p-values are high, which suggests that results are very
> noisy. I wonder if the benchmark had enough iterations or taskRepeatCount.
>
> Le jeu. 5 déc. 2024, 17:38, Ignacio Vera <[email protected]> a écrit :
>>
>> @Cris: Agreeing on an off-heap BytesRef thingy would be a great step forward.
>>
>> @Mike: Yes, there are other use cases. One that is close to my heart
>> is the geo use case where in many cases you don't need to read all the
>> bytes, and geometries can be big. In lucene there are some interesting
>> usages in the facets module which I already implemented in the PR.
>> Running the wikimedium benchmark on it (I think ) it shows an
>> improvement on the facets runs as well as some regressions:
>>
>> BrowseRandomLabelSSDVFacets 5.21 (12.5%) 5.01
>> (8.8%) -3.8% ( -22% - 20%) 0.264
>> OrHighMedDayTaxoFacets 4.78 (5.5%) 4.68
>> (5.0%) -2.1% ( -11% - 8%) 0.211
>> HighTermTitleBDVSort 9.11 (1.5%) 8.96
>> (1.4%) -1.7% ( -4% - 1%) 0.000
>> BrowseDayOfYearSSDVFacets 6.64 (15.7%) 6.54
>> (15.1%) -1.5% ( -27% - 34%) 0.765
>> BrowseMonthSSDVFacets 6.50 (11.8%) 6.42
>> (11.6%) -1.3% ( -22% - 25%) 0.729
>> HighTerm 565.97 (6.5%) 562.79
>> (7.0%) -0.6% ( -13% - 13%) 0.793
>> HighTermMonthSort 2350.90 (4.5%) 2338.02
>> (4.0%) -0.5% ( -8% - 8%) 0.684
>> AndHighHigh 55.14 (3.3%) 54.97
>> (4.1%) -0.3% ( -7% - 7%) 0.803
>> OrHighNotMed 406.21 (7.2%) 405.14
>> (6.6%) -0.3% ( -13% - 14%) 0.904
>> OrNotHighHigh 425.43 (3.5%) 424.78
>> (3.3%) -0.2% ( -6% - 6%) 0.886
>> MedTermDayTaxoFacets 30.90 (2.0%) 30.86
>> (1.6%) -0.1% ( -3% - 3%) 0.834
>> MedSloppyPhrase 32.77 (3.7%) 32.73
>> (3.7%) -0.1% ( -7% - 7%) 0.921
>> AndHighMed 184.44 (3.4%) 184.37
>> (3.6%) -0.0% ( -6% - 7%) 0.969
>> LowSloppyPhrase 52.47 (1.7%) 52.47
>> (1.6%) 0.0% ( -3% - 3%) 0.996
>> OrHighNotHigh 619.12 (5.0%) 619.46
>> (4.6%) 0.1% ( -9% - 10%) 0.971
>> OrHighNotLow 567.45 (6.5%) 568.21
>> (6.2%) 0.1% ( -11% - 13%) 0.947
>> PKLookup 275.72 (2.3%) 276.13
>> (3.3%) 0.1% ( -5% - 5%) 0.872
>> LowIntervalsOrdered 6.15 (2.1%) 6.16
>> (2.5%) 0.1% ( -4% - 4%) 0.836
>> IntNRQ 76.94 (5.8%) 77.11
>> (4.5%) 0.2% ( -9% - 11%) 0.895
>> HighSloppyPhrase 2.32 (2.7%) 2.33
>> (2.3%) 0.3% ( -4% - 5%) 0.685
>> LowTerm 629.84 (3.5%) 632.74
>> (3.3%) 0.5% ( -6% - 7%) 0.670
>> LowSpanNear 99.79 (2.7%) 100.30
>> (3.2%) 0.5% ( -5% - 6%) 0.589
>> MedTerm 889.05 (4.2%) 893.75
>> (4.6%) 0.5% ( -7% - 9%) 0.703
>> OrNotHighMed 361.55 (3.2%) 363.50
>> (3.1%) 0.5% ( -5% - 7%) 0.591
>> Prefix3 134.42 (4.3%) 135.18
>> (3.8%) 0.6% ( -7% - 8%) 0.656
>> HighTermTitleSort 188.27 (2.1%) 189.35
>> (2.5%) 0.6% ( -3% - 5%) 0.423
>> HighIntervalsOrdered 7.97 (4.8%) 8.02
>> (6.0%) 0.6% ( -9% - 11%) 0.736
>> Wildcard 57.19 (2.9%) 57.53
>> (3.1%) 0.6% ( -5% - 6%) 0.525
>> OrHighLow 551.58 (2.9%) 555.28
>> (2.5%) 0.7% ( -4% - 6%) 0.436
>> MedIntervalsOrdered 29.22 (4.9%) 29.41
>> (6.1%) 0.7% ( -9% - 12%) 0.697
>> MedPhrase 30.11 (2.1%) 30.32
>> (1.5%) 0.7% ( -2% - 4%) 0.241
>> OrHighHigh 54.77 (6.7%) 55.15
>> (5.2%) 0.7% ( -10% - 13%) 0.714
>> Fuzzy1 108.14 (2.8%) 108.90
>> (2.4%) 0.7% ( -4% - 6%) 0.403
>> OrHighMed 182.51 (5.4%) 183.80
>> (3.4%) 0.7% ( -7% - 10%) 0.622
>> AndHighMedDayTaxoFacets 30.18 (3.1%) 30.40
>> (2.3%) 0.7% ( -4% - 6%) 0.403
>> HighTermDayOfYearSort 462.68 (3.7%) 466.03
>> (3.6%) 0.7% ( -6% - 8%) 0.532
>> AndHighLow 1225.05 (5.2%) 1233.95
>> (4.5%) 0.7% ( -8% - 10%) 0.636
>> MedSpanNear 13.85 (2.2%) 13.95
>> (2.0%) 0.7% ( -3% - 5%) 0.264
>> LowPhrase 204.19 (2.6%) 205.88
>> (1.9%) 0.8% ( -3% - 5%) 0.247
>> HighPhrase 105.85 (3.1%) 106.80
>> (2.6%) 0.9% ( -4% - 6%) 0.322
>> Fuzzy2 22.92 (2.6%) 23.13
>> (2.1%) 0.9% ( -3% - 5%) 0.233
>> TermDTSort 295.84 (7.3%) 298.66
>> (6.6%) 1.0% ( -12% - 16%) 0.665
>> Respell 78.37 (2.3%) 79.15
>> (1.8%) 1.0% ( -2% - 5%) 0.125
>> AndHighHighDayTaxoFacets 2.70 (4.8%) 2.72
>> (2.5%) 1.0% ( -6% - 8%) 0.407
>> OrNotHighLow 1134.11 (3.2%) 1146.96
>> (3.8%) 1.1% ( -5% - 8%) 0.310
>> HighSpanNear 3.88 (7.1%) 3.95
>> (4.9%) 1.7% ( -9% - 14%) 0.376
>> range 5910.33 (9.7%) 6049.55
>> (8.0%) 2.4% ( -14% - 22%) 0.403
>> BrowseDateSSDVFacets 1.19 (14.3%) 1.24
>> (19.0%) 4.1% ( -25% - 43%) 0.446
>> BrowseDateTaxoFacets 6.67 (4.6%) 7.08
>> (24.2%) 6.1% ( -21% - 36%) 0.264
>> BrowseDayOfYearTaxoFacets 6.74 (4.9%) 7.17
>> (23.8%) 6.4% ( -21% - 36%) 0.237
>> BrowseRandomLabelTaxoFacets 5.39 (3.7%) 6.02
>> (52.8%) 11.7% ( -43% - 70%) 0.322
>> BrowseMonthTaxoFacets 8.20 (35.8%) 9.48
>> (37.2%) 15.6% ( -42% - 138%) 0.177
>>
>>
>>
>> On Thu, Dec 5, 2024 at 2:07 PM Michael Sokolov <[email protected]> wrote:
>> >
>> > That makes sense to me too in the abstract. At Amazon we also have
>> > interesting BDV fields we have to decode on the fly, so this looks
>> > attractive for that reason (not just faceting).
>> >
>> > I would say though that it would be easier to evaluate the fitness for
>> > purpose (faceting) if we had some examples of BinaryDocValues used for
>> > faceting (or otherwise being decoded on the fly) in the Lucene code
>> > base -- do we have that? I'd be concerned if we're not able to fully
>> > test the new functionality to see what the impact of any changes might
>> > be.
>> >
>> > On Thu, Dec 5, 2024 at 6:45 AM Chris Hegarty
>> > <[email protected]> wrote:
>> > >
>> > > Hi Ignacio,
>> > >
>> > > I completely agree with the idea of having a BytesRef-like thing that
>> > > can be off-heap. For a while now I’ve been thinking about how we could
>> > > evolve BytesRef so as to not expose its on-heap representation. Having a
>> > > separate primitive is probably a better way to go.
>> > >
>> > > -Chris.
>> > >
>> > > > On 5 Dec 2024, at 10:42, Ignacio Vera <[email protected]> wrote:
>> > > >
>> > > > Hello,
>> > > >
>> > > > I have been working with the idea of reading binary doc values
>> > > > off-heap for a while. The idea behind it is that binary doc values are
>> > > > often used for faceting where structure data is encoded at write time
>> > > > and decoded at read time. It feels wasteful to have to read the data
>> > > > on-heap before decoding it when we can read the data directly from the
>> > > > off-heap buffer.
>> > > >
>> > > > The current proposal is to evolve the current API from an on-heap data
>> > > > structure (BytesRef) to an off-heap data structure (currently named
>> > > > RandomAccessInputRef). Because we are currently reading the data into
>> > > > the buffer using a RandomAccessInput with an offset and a length, it
>> > > > feels very natural to create an off-heap equivalent to BytesRef that
>> > > > is backed by a RandomAccessInput.
>> > > >
>> > > > I am hoping to move this idea forward so I am asking for feedback as
>> > > > this is a change on a public API so I would love to hear other
>> > > > opinions.
>> > > >
>> > > > Thank you!
>> > > >
>> > > > ---------------------------------------------------------------------
>> > > > To unsubscribe, e-mail: [email protected]
>> > > > For additional commands, e-mail: [email protected]
>> > > >
>> > >
>> > >
>> > > ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: [email protected]
>> > > For additional commands, e-mail: [email protected]
>> > >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [email protected]
>> > For additional commands, e-mail: [email protected]
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]