Re: Off-heap binary doc values

Ignacio Vera Thu, 05 Dec 2024 08:38:47 -0800

@Cris: Agreeing on an off-heap BytesRef thingy would be a great step forward.


@Mike: Yes, there are other use cases. One that is close to my heart
is the geo use case where in many cases you don't need to read all the
bytes, and geometries can be big. In lucene there are some interesting
usages in the facets module which I already implemented in the PR.
Running the wikimedium benchmark on it (I think ) it shows an
improvement on the facets runs as well as some regressions:

     BrowseRandomLabelSSDVFacets        5.21     (12.5%)        5.01
   (8.8%)   -3.8% ( -22% -   20%) 0.264
          OrHighMedDayTaxoFacets        4.78      (5.5%)        4.68
   (5.0%)   -2.1% ( -11% -    8%) 0.211
            HighTermTitleBDVSort        9.11      (1.5%)        8.96
   (1.4%)   -1.7% (  -4% -    1%) 0.000
       BrowseDayOfYearSSDVFacets        6.64     (15.7%)        6.54
  (15.1%)   -1.5% ( -27% -   34%) 0.765
           BrowseMonthSSDVFacets        6.50     (11.8%)        6.42
  (11.6%)   -1.3% ( -22% -   25%) 0.729
                        HighTerm      565.97      (6.5%)      562.79
   (7.0%)   -0.6% ( -13% -   13%) 0.793
               HighTermMonthSort     2350.90      (4.5%)     2338.02
   (4.0%)   -0.5% (  -8% -    8%) 0.684
                     AndHighHigh       55.14      (3.3%)       54.97
   (4.1%)   -0.3% (  -7% -    7%) 0.803
                    OrHighNotMed      406.21      (7.2%)      405.14
   (6.6%)   -0.3% ( -13% -   14%) 0.904
                   OrNotHighHigh      425.43      (3.5%)      424.78
   (3.3%)   -0.2% (  -6% -    6%) 0.886
            MedTermDayTaxoFacets       30.90      (2.0%)       30.86
   (1.6%)   -0.1% (  -3% -    3%) 0.834
                 MedSloppyPhrase       32.77      (3.7%)       32.73
   (3.7%)   -0.1% (  -7% -    7%) 0.921
                      AndHighMed      184.44      (3.4%)      184.37
   (3.6%)   -0.0% (  -6% -    7%) 0.969
                 LowSloppyPhrase       52.47      (1.7%)       52.47
   (1.6%)    0.0% (  -3% -    3%) 0.996
                   OrHighNotHigh      619.12      (5.0%)      619.46
   (4.6%)    0.1% (  -9% -   10%) 0.971
                    OrHighNotLow      567.45      (6.5%)      568.21
   (6.2%)    0.1% ( -11% -   13%) 0.947
                        PKLookup      275.72      (2.3%)      276.13
   (3.3%)    0.1% (  -5% -    5%) 0.872
             LowIntervalsOrdered        6.15      (2.1%)        6.16
   (2.5%)    0.1% (  -4% -    4%) 0.836
                          IntNRQ       76.94      (5.8%)       77.11
   (4.5%)    0.2% (  -9% -   11%) 0.895
                HighSloppyPhrase        2.32      (2.7%)        2.33
   (2.3%)    0.3% (  -4% -    5%) 0.685
                         LowTerm      629.84      (3.5%)      632.74
   (3.3%)    0.5% (  -6% -    7%) 0.670
                     LowSpanNear       99.79      (2.7%)      100.30
   (3.2%)    0.5% (  -5% -    6%) 0.589
                         MedTerm      889.05      (4.2%)      893.75
   (4.6%)    0.5% (  -7% -    9%) 0.703
                    OrNotHighMed      361.55      (3.2%)      363.50
   (3.1%)    0.5% (  -5% -    7%) 0.591
                         Prefix3      134.42      (4.3%)      135.18
   (3.8%)    0.6% (  -7% -    8%) 0.656
               HighTermTitleSort      188.27      (2.1%)      189.35
   (2.5%)    0.6% (  -3% -    5%) 0.423
            HighIntervalsOrdered        7.97      (4.8%)        8.02
   (6.0%)    0.6% (  -9% -   11%) 0.736
                        Wildcard       57.19      (2.9%)       57.53
   (3.1%)    0.6% (  -5% -    6%) 0.525
                       OrHighLow      551.58      (2.9%)      555.28
   (2.5%)    0.7% (  -4% -    6%) 0.436
             MedIntervalsOrdered       29.22      (4.9%)       29.41
   (6.1%)    0.7% (  -9% -   12%) 0.697
                       MedPhrase       30.11      (2.1%)       30.32
   (1.5%)    0.7% (  -2% -    4%) 0.241
                      OrHighHigh       54.77      (6.7%)       55.15
   (5.2%)    0.7% ( -10% -   13%) 0.714
                          Fuzzy1      108.14      (2.8%)      108.90
   (2.4%)    0.7% (  -4% -    6%) 0.403
                       OrHighMed      182.51      (5.4%)      183.80
   (3.4%)    0.7% (  -7% -   10%) 0.622
         AndHighMedDayTaxoFacets       30.18      (3.1%)       30.40
   (2.3%)    0.7% (  -4% -    6%) 0.403
           HighTermDayOfYearSort      462.68      (3.7%)      466.03
   (3.6%)    0.7% (  -6% -    8%) 0.532
                      AndHighLow     1225.05      (5.2%)     1233.95
   (4.5%)    0.7% (  -8% -   10%) 0.636
                     MedSpanNear       13.85      (2.2%)       13.95
   (2.0%)    0.7% (  -3% -    5%) 0.264
                       LowPhrase      204.19      (2.6%)      205.88
   (1.9%)    0.8% (  -3% -    5%) 0.247
                      HighPhrase      105.85      (3.1%)      106.80
   (2.6%)    0.9% (  -4% -    6%) 0.322
                          Fuzzy2       22.92      (2.6%)       23.13
   (2.1%)    0.9% (  -3% -    5%) 0.233
                      TermDTSort      295.84      (7.3%)      298.66
   (6.6%)    1.0% ( -12% -   16%) 0.665
                         Respell       78.37      (2.3%)       79.15
   (1.8%)    1.0% (  -2% -    5%) 0.125
        AndHighHighDayTaxoFacets        2.70      (4.8%)        2.72
   (2.5%)    1.0% (  -6% -    8%) 0.407
                    OrNotHighLow     1134.11      (3.2%)     1146.96
   (3.8%)    1.1% (  -5% -    8%) 0.310
                    HighSpanNear        3.88      (7.1%)        3.95
   (4.9%)    1.7% (  -9% -   14%) 0.376
                           range     5910.33      (9.7%)     6049.55
   (8.0%)    2.4% ( -14% -   22%) 0.403
            BrowseDateSSDVFacets        1.19     (14.3%)        1.24
  (19.0%)    4.1% ( -25% -   43%) 0.446
            BrowseDateTaxoFacets        6.67      (4.6%)        7.08
  (24.2%)    6.1% ( -21% -   36%) 0.264
       BrowseDayOfYearTaxoFacets        6.74      (4.9%)        7.17
  (23.8%)    6.4% ( -21% -   36%) 0.237
     BrowseRandomLabelTaxoFacets        5.39      (3.7%)        6.02
  (52.8%)   11.7% ( -43% -   70%) 0.322
           BrowseMonthTaxoFacets        8.20     (35.8%)        9.48
  (37.2%)   15.6% ( -42% -  138%) 0.177



On Thu, Dec 5, 2024 at 2:07 PM Michael Sokolov <[email protected]> wrote:
>
> That makes sense to me too in the abstract. At Amazon we also have
> interesting BDV fields we have to decode on the fly, so this looks
> attractive for that reason (not just faceting).
>
> I would say though that it would be easier to evaluate the fitness for
> purpose (faceting) if we had some examples of BinaryDocValues used for
> faceting (or otherwise being decoded on the fly) in the Lucene code
> base -- do we have that?  I'd be concerned if we're not able to fully
> test the new functionality to see what the impact of any changes might
> be.
>
> On Thu, Dec 5, 2024 at 6:45 AM Chris Hegarty
> <[email protected]> wrote:
> >
> > Hi Ignacio,
> >
> > I completely agree with the idea of having a BytesRef-like thing that can 
> > be off-heap. For a while now I’ve been thinking about how we could evolve 
> > BytesRef so as to not expose its on-heap representation. Having a separate 
> > primitive is probably a better way to go.
> >
> > -Chris.
> >
> > > On 5 Dec 2024, at 10:42, Ignacio Vera <[email protected]> wrote:
> > >
> > > Hello,
> > >
> > > I have been working with the idea of reading binary doc values
> > > off-heap for a while. The idea behind it is that binary doc values are
> > > often used for faceting where structure data is encoded at write time
> > > and decoded at read time. It feels wasteful to have to read the data
> > > on-heap before decoding it when we can read the data directly from the
> > > off-heap buffer.
> > >
> > > The current proposal is to evolve the current API from an on-heap data
> > > structure (BytesRef) to an off-heap data structure (currently named
> > > RandomAccessInputRef). Because we are currently reading the data into
> > > the buffer using a RandomAccessInput with an offset and a length, it
> > > feels very natural to create an off-heap equivalent to BytesRef that
> > > is backed by a RandomAccessInput.
> > >
> > > I am hoping to move this idea forward so I am asking for feedback as
> > > this is a change on a public API so I would love to hear other
> > > opinions.
> > >
> > > Thank you!
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Off-heap binary doc values

Reply via email to