Re: [DISCUSS][C++] Raw pointer string views

Benjamin Kietzman Mon, 02 Oct 2023 12:24:14 -0700

@Antoine
> By construction, a standard cannot continuously chase the performance
state of
> art, it has to weigh the benefits of performance improvements against the
> increased cost for the ecosystem.


> We have extension types which could reasonably be used for non-standard
> data types, especially the kind that are motivated by leading-edge
> performance research and innovation and come with unusual constraints
> (such as requiring trusting and dereferencing raw pointers embedded in
> data buffers). There could even be an argument for making some of them
> canonical extension types if there's enough anteriority in favor.

I agree that the standard becoming unfocused and ala carte is to be avoided.
However I would argue that the addition of raw pointer views as a C ABI
extension type represents little of that danger. The addition is entirely
bounded as it is binary (with or without raw pointers), and is semantically
identical to the index/offset representation with the caveat of differing
dereferencing.

This last makes it more nuanced than the cases covered by canonical
extension
types. If raw pointer views were a canonical extension using index/offset
views
as their storage, they could not be naively validated by considering the
storage
alone since raw pointers couldn't be validly reinterpreting as an
index/offset pair.
In light of this and in the context of the C ABI I'd advocate for a
dedicated
data type descriptor [1] for raw pointer views to make the distinction more
obvious,
but otherwise I think it is reasonable to consider them an extension type.

I am currently working on a draft PR adding this type to the C ABI spec by
way of
a proposal for everyone's consideration.

@Raphael
> Do you have any benchmarks comparing kernels with native pointer array
support,
> compared to those that must first convert to the offset representation? I
think
> this would help ground this discussion empirically.

Conversion cost can end up being a bottleneck for rapid operations, much
more so
than the overhead of accessing either representation (hence the advantage of
operating in place on whatever representation we have). For a simple
comparison,
I benchmarked checking random views in each representation for alphanumeric
characters. Raw pointer views have a 6% advantage.

```
Run on (16 X 5300 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 16384 KiB (x1)
-----------------------------------------------------------------------------------------------
Benchmark                                     Time             CPU
Iterations UserCounters...
-----------------------------------------------------------------------------------------------
IsAlnum<kIndexOffsetViews>             24086566 ns     24084367 ns
  29 bytes_per_second=664.276M/s items_per_second=43.5376M/s
IsAlnum<kRawPointerViews>              22613218 ns     22612044 ns
  31 bytes_per_second=707.529M/s items_per_second=46.3725M/s
```

Below is a table of benchmarks for conversion between representations.
To summarize those:

- Conversion cost is strongly dependent on string lengths.
  - I would tentatively focus on `kUsuallyInlineable` as most representative
- Conversion from views to dense strings is slowest since we must copy each
  view's characters into a single character buffer.
  - comparable to parsing 64 bit floats
  - preallocation is not perfect, so the character buffer must be check for
    resizing inside the hot loop
- Conversion from dense strings to views is 2-4x faster
  - comparable to parsing 64 bit integers
  - we only need to allocate once before conversion
  - we still need to access the character buffer in order to copy inline
contents
    or cached prefixes into the headers
- Conversion from index/offset to raw pointer views is fairly quick
  - comparable to integer->integer conversion with safe overflow and high
null percentage
- Conversion from raw pointer to index/offset views is 2/3 as fast

```
--------------------------------------------------------------------------------------------------------------------------------------
Benchmark
         Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------------------
ConvertViews<(from type), (to type), (length category)>
--------------------------------------------------------------------------------------------------------------------------------------
ConvertViews<kStrings, kIndexOffsetViews, kAlwaysInlineable>
   42877057 ns     42875885 ns           16 items_per_second=32.6081M/s
                                          kUsuallyInlineable>
  34079672 ns     34075604 ns           21 items_per_second=30.772M/s
                                          kShortButNeverInlineable>
  16044043 ns     16043702 ns           43 items_per_second=34.8573M/s
                                          kLongAndSeldomInlineable>
   1717984 ns      1717955 ns          376 items_per_second=38.1477M/s
                                          kLongAndNeverInlineable>
    1707074 ns      1706973 ns          413 items_per_second=38.3931M/s

ConvertViews<kIndexOffsetViews, kStrings, kAlwaysInlineable>
   85538939 ns     85532072 ns            8 items_per_second=16.3459M/s
                                          kUsuallyInlineable>
  66432452 ns     66417147 ns           10 items_per_second=15.7877M/s
                                          kShortButNeverInlineable>
  36025089 ns     36021631 ns           19 items_per_second=15.5251M/s
                                          kLongAndSeldomInlineable>
   8791312 ns      8789937 ns           80 items_per_second=7.4558M/s
                                          kLongAndNeverInlineable>
    6272905 ns      6272238 ns          112 items_per_second=10.4486M/s

ConvertViews<kRawPointerViews, kIndexOffsetViews, kAlwaysInlineable>
   15400749 ns     15400729 ns           45 items_per_second=90.7815M/s
                                                  kUsuallyInlineable>
  21527529 ns     21527622 ns           33 items_per_second=48.7084M/s
                                                  kShortButNeverInlineable>
  25101062 ns     25099755 ns           28 items_per_second=22.2807M/s
                                                  kLongAndSeldomInlineable>
   2665299 ns      2665111 ns          262 items_per_second=24.5903M/s
                                                  kLongAndNeverInlineable>
    2694563 ns      2694485 ns          260 items_per_second=24.3223M/s

ConvertViews<kIndexOffsetViews, kRawPointerViews, kAlwaysInlineable>
   15359965 ns     15358626 ns           46 items_per_second=91.0303M/s
                                                  kUsuallyInlineable>
  13967232 ns     13967093 ns           50 items_per_second=75.0748M/s
                                                  kShortButNeverInlineable>
   7861021 ns      7860546 ns           89 items_per_second=71.1452M/s
                                                  kLongAndSeldomInlineable>
    729323 ns       729272 ns          969 items_per_second=89.865M/s
                                                  kLongAndNeverInlineable>
     709887 ns       709827 ns          965 items_per_second=92.3267M/s
```

Sincerely,
Ben Kietzman

[1]
https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings

On Mon, Oct 2, 2023 at 9:22 AM Andrew Lamb <al...@influxdata.com> wrote:

> > I don't think "we have to adjust the Arrow format so that existing
> > internal representations become Arrow-compliant without any
> > (re-)implementation effort" is a reasonable design principle.
>
> I agree with this statement from Antoine -- given the Arrow community has
> standardized an addition to the format with StringView, I think it would
> help to get some input from those at DuckDB and Velox on their perspective
>
> Andrew
>
>
>
>
> On Mon, Oct 2, 2023 at 9:17 AM Raphael Taylor-Davies
> <r.taylordav...@googlemail.com.invalid> wrote:
>
> > Oh I'm with you on it being a precedent we want to be very careful about
> > setting, but if there isn't a meaningful performance difference, we may
> > be able to sidestep that discussion entirely.
> >
> > On 02/10/2023 14:11, Antoine Pitrou wrote:
> > >
> > > Even if performance were significant better, I don't think it's a good
> > > enough reason to add these representations to Arrow. By construction,
> > > a standard cannot continuously chase the performance state of art, it
> > > has to weigh the benefits of performance improvements against the
> > > increased cost for the ecosystem (for example the cost of adapting to
> > > frequent standard changes and a growing standard size).
> > >
> > > We have extension types which could reasonably be used for
> > > non-standard data types, especially the kind that are motivated by
> > > leading-edge performance research and innovation and come with unusual
> > > constraints (such as requiring trusting and dereferencing raw pointers
> > > embedded in data buffers). There could even be an argument for making
> > > some of them canonical extension types if there's enough anteriority
> > > in favor.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 02/10/2023 à 15:00, Raphael Taylor-Davies a écrit :
> > >> I think what would really help would be some concrete numbers, do we
> > >> have any numbers comparing the performance of the offset and pointer
> > >> based representations? If there isn't a significant performance
> > >> difference between them, would the systems that currently use a
> > >> pointer-based approach be willing to meet us in the middle and switch
> to
> > >> an offset based encoding? This to me feels like it would be the best
> > >> outcome for the ecosystem as a whole.
> > >>
> > >> Kind Regards,
> > >>
> > >> Raphael
> > >>
> > >> On 02/10/2023 13:50, Antoine Pitrou wrote:
> > >>>
> > >>> Le 01/10/2023 à 16:21, Micah Kornfield a écrit :
> > >>>>>
> > >>>>> I would also assert that another way to reduce this risk is to add
> > >>>>> some prose to the relevant sections of the columnar format
> > >>>>> specification doc to clearly explain that a raw pointers variant of
> > >>>>> the layout, while not part of the official spec, may be
> > >>>>> implemented in
> > >>>>> some Arrow libraries.
> > >>>>
> > >>>> I've lost a little context but on all the concerns of adding raw
> > >>>> pointers
> > >>>> as an official option to the spec.  But I see making raw-pointer
> > >>>> variants
> > >>>> the best path forward.
> > >>>>
> > >>>> Things captured from this thread or seem obvious at least to me:
> > >>>> 1.  Divergence of IPC spec from in-memory/C-ABI spec?
> > >>>> 2.  More parts of the spec to cover.
> > >>>> 3.  In-compatibility with some languages
> > >>>> 4.  Validation (in my mind different use-cases require different
> > >>>> levels of
> > >>>> validation, so this is a little bit less of a concern in my mind).
> > >>>>
> > >>>> I think the broader issue is how we think about compatibility with
> > >>>> other
> > >>>> systems.  For instance, what happens if Velox and DuckDb start
> adding
> > >>>> new
> > >>>> divergent memory layouts?  Are we expecting to add them to the spec?
> > >>>
> > >>> This is a slippery slope. The more Arrow has a policy of integrating
> > >>> existing practices simply because they exist, the more the Arrow
> > >>> format will become _à la carte_, with different implementations
> > >>> choosing to implement whatever they want to spend their engineering
> > >>> effort on (you can see this occur, in part, on the Parquet format
> with
> > >>> its many different encodings, compression algorithms and a 96-bit
> > >>> timestamp type).
> > >>>
> > >>> We _have_ to think carefully about the middle- and long-term future
> of
> > >>> the format when adopting new features.
> > >>>
> > >>> In this instance, we are doing a large part of the effort by adopting
> > >>> a string view format with variadic buffers, inlined prefixes and
> > >>> offset-based views into those buffers. But some implementations with
> > >>> historically different internal representations will have to share
> > >>> part of the effort to align with the newly standardized format.
> > >>>
> > >>> I don't think "we have to adjust the Arrow format so that existing
> > >>> internal representations become Arrow-compliant without any
> > >>> (re-)implementation effort" is a reasonable design principle.
> > >>>
> > >>> Regards
> > >>>
> > >>> Antoine.
> >
>

Re: [DISCUSS][C++] Raw pointer string views

Reply via email to