@Cris: Agreeing on an off-heap BytesRef thingy would be a great step forward.
@Mike: Yes, there are other use cases. One that is close to my heart
is the geo use case where in many cases you don't need to read all the
bytes, and geometries can be big. In lucene there are some interesting
usages in the facets module which I already implemented in the PR.
Running the wikimedium benchmark on it (I think ) it shows an
improvement on the facets runs as well as some regressions:
BrowseRandomLabelSSDVFacets 5.21 (12.5%) 5.01
(8.8%) -3.8% ( -22% - 20%) 0.264
OrHighMedDayTaxoFacets 4.78 (5.5%) 4.68
(5.0%) -2.1% ( -11% - 8%) 0.211
HighTermTitleBDVSort 9.11 (1.5%) 8.96
(1.4%) -1.7% ( -4% - 1%) 0.000
BrowseDayOfYearSSDVFacets 6.64 (15.7%) 6.54
(15.1%) -1.5% ( -27% - 34%) 0.765
BrowseMonthSSDVFacets 6.50 (11.8%) 6.42
(11.6%) -1.3% ( -22% - 25%) 0.729
HighTerm 565.97 (6.5%) 562.79
(7.0%) -0.6% ( -13% - 13%) 0.793
HighTermMonthSort 2350.90 (4.5%) 2338.02
(4.0%) -0.5% ( -8% - 8%) 0.684
AndHighHigh 55.14 (3.3%) 54.97
(4.1%) -0.3% ( -7% - 7%) 0.803
OrHighNotMed 406.21 (7.2%) 405.14
(6.6%) -0.3% ( -13% - 14%) 0.904
OrNotHighHigh 425.43 (3.5%) 424.78
(3.3%) -0.2% ( -6% - 6%) 0.886
MedTermDayTaxoFacets 30.90 (2.0%) 30.86
(1.6%) -0.1% ( -3% - 3%) 0.834
MedSloppyPhrase 32.77 (3.7%) 32.73
(3.7%) -0.1% ( -7% - 7%) 0.921
AndHighMed 184.44 (3.4%) 184.37
(3.6%) -0.0% ( -6% - 7%) 0.969
LowSloppyPhrase 52.47 (1.7%) 52.47
(1.6%) 0.0% ( -3% - 3%) 0.996
OrHighNotHigh 619.12 (5.0%) 619.46
(4.6%) 0.1% ( -9% - 10%) 0.971
OrHighNotLow 567.45 (6.5%) 568.21
(6.2%) 0.1% ( -11% - 13%) 0.947
PKLookup 275.72 (2.3%) 276.13
(3.3%) 0.1% ( -5% - 5%) 0.872
LowIntervalsOrdered 6.15 (2.1%) 6.16
(2.5%) 0.1% ( -4% - 4%) 0.836
IntNRQ 76.94 (5.8%) 77.11
(4.5%) 0.2% ( -9% - 11%) 0.895
HighSloppyPhrase 2.32 (2.7%) 2.33
(2.3%) 0.3% ( -4% - 5%) 0.685
LowTerm 629.84 (3.5%) 632.74
(3.3%) 0.5% ( -6% - 7%) 0.670
LowSpanNear 99.79 (2.7%) 100.30
(3.2%) 0.5% ( -5% - 6%) 0.589
MedTerm 889.05 (4.2%) 893.75
(4.6%) 0.5% ( -7% - 9%) 0.703
OrNotHighMed 361.55 (3.2%) 363.50
(3.1%) 0.5% ( -5% - 7%) 0.591
Prefix3 134.42 (4.3%) 135.18
(3.8%) 0.6% ( -7% - 8%) 0.656
HighTermTitleSort 188.27 (2.1%) 189.35
(2.5%) 0.6% ( -3% - 5%) 0.423
HighIntervalsOrdered 7.97 (4.8%) 8.02
(6.0%) 0.6% ( -9% - 11%) 0.736
Wildcard 57.19 (2.9%) 57.53
(3.1%) 0.6% ( -5% - 6%) 0.525
OrHighLow 551.58 (2.9%) 555.28
(2.5%) 0.7% ( -4% - 6%) 0.436
MedIntervalsOrdered 29.22 (4.9%) 29.41
(6.1%) 0.7% ( -9% - 12%) 0.697
MedPhrase 30.11 (2.1%) 30.32
(1.5%) 0.7% ( -2% - 4%) 0.241
OrHighHigh 54.77 (6.7%) 55.15
(5.2%) 0.7% ( -10% - 13%) 0.714
Fuzzy1 108.14 (2.8%) 108.90
(2.4%) 0.7% ( -4% - 6%) 0.403
OrHighMed 182.51 (5.4%) 183.80
(3.4%) 0.7% ( -7% - 10%) 0.622
AndHighMedDayTaxoFacets 30.18 (3.1%) 30.40
(2.3%) 0.7% ( -4% - 6%) 0.403
HighTermDayOfYearSort 462.68 (3.7%) 466.03
(3.6%) 0.7% ( -6% - 8%) 0.532
AndHighLow 1225.05 (5.2%) 1233.95
(4.5%) 0.7% ( -8% - 10%) 0.636
MedSpanNear 13.85 (2.2%) 13.95
(2.0%) 0.7% ( -3% - 5%) 0.264
LowPhrase 204.19 (2.6%) 205.88
(1.9%) 0.8% ( -3% - 5%) 0.247
HighPhrase 105.85 (3.1%) 106.80
(2.6%) 0.9% ( -4% - 6%) 0.322
Fuzzy2 22.92 (2.6%) 23.13
(2.1%) 0.9% ( -3% - 5%) 0.233
TermDTSort 295.84 (7.3%) 298.66
(6.6%) 1.0% ( -12% - 16%) 0.665
Respell 78.37 (2.3%) 79.15
(1.8%) 1.0% ( -2% - 5%) 0.125
AndHighHighDayTaxoFacets 2.70 (4.8%) 2.72
(2.5%) 1.0% ( -6% - 8%) 0.407
OrNotHighLow 1134.11 (3.2%) 1146.96
(3.8%) 1.1% ( -5% - 8%) 0.310
HighSpanNear 3.88 (7.1%) 3.95
(4.9%) 1.7% ( -9% - 14%) 0.376
range 5910.33 (9.7%) 6049.55
(8.0%) 2.4% ( -14% - 22%) 0.403
BrowseDateSSDVFacets 1.19 (14.3%) 1.24
(19.0%) 4.1% ( -25% - 43%) 0.446
BrowseDateTaxoFacets 6.67 (4.6%) 7.08
(24.2%) 6.1% ( -21% - 36%) 0.264
BrowseDayOfYearTaxoFacets 6.74 (4.9%) 7.17
(23.8%) 6.4% ( -21% - 36%) 0.237
BrowseRandomLabelTaxoFacets 5.39 (3.7%) 6.02
(52.8%) 11.7% ( -43% - 70%) 0.322
BrowseMonthTaxoFacets 8.20 (35.8%) 9.48
(37.2%) 15.6% ( -42% - 138%) 0.177
On Thu, Dec 5, 2024 at 2:07 PM Michael Sokolov <[email protected]> wrote:
>
> That makes sense to me too in the abstract. At Amazon we also have
> interesting BDV fields we have to decode on the fly, so this looks
> attractive for that reason (not just faceting).
>
> I would say though that it would be easier to evaluate the fitness for
> purpose (faceting) if we had some examples of BinaryDocValues used for
> faceting (or otherwise being decoded on the fly) in the Lucene code
> base -- do we have that? I'd be concerned if we're not able to fully
> test the new functionality to see what the impact of any changes might
> be.
>
> On Thu, Dec 5, 2024 at 6:45 AM Chris Hegarty
> <[email protected]> wrote:
> >
> > Hi Ignacio,
> >
> > I completely agree with the idea of having a BytesRef-like thing that can
> > be off-heap. For a while now I’ve been thinking about how we could evolve
> > BytesRef so as to not expose its on-heap representation. Having a separate
> > primitive is probably a better way to go.
> >
> > -Chris.
> >
> > > On 5 Dec 2024, at 10:42, Ignacio Vera <[email protected]> wrote:
> > >
> > > Hello,
> > >
> > > I have been working with the idea of reading binary doc values
> > > off-heap for a while. The idea behind it is that binary doc values are
> > > often used for faceting where structure data is encoded at write time
> > > and decoded at read time. It feels wasteful to have to read the data
> > > on-heap before decoding it when we can read the data directly from the
> > > off-heap buffer.
> > >
> > > The current proposal is to evolve the current API from an on-heap data
> > > structure (BytesRef) to an off-heap data structure (currently named
> > > RandomAccessInputRef). Because we are currently reading the data into
> > > the buffer using a RandomAccessInput with an offset and a length, it
> > > feels very natural to create an off-heap equivalent to BytesRef that
> > > is backed by a RandomAccessInput.
> > >
> > > I am hoping to move this idea forward so I am asking for feedback as
> > > this is a change on a public API so I would love to hear other
> > > opinions.
> > >
> > > Thank you!
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]