Re: [DISCUSS][C++] Raw pointer string views

Raphael Taylor-Davies Mon, 02 Oct 2023 13:08:06 -0700

Thank you for this, my major takeaways are:

- The performance characteristics of the two view representations arebroadly equivalent, with an extremely minor edge to the pointerrepresentation- Both view types represent a significant performance win overconverting to the non-view representations- Conversion between the two view representations is relatively fast,especially for small strings


Given this might one path forward be for

- Arrow to only support the offset-based representation as currentlystandardised- Velox/DuckDB add support for both offset and pointer basedrepresentations internally- Velox/DuckDB eventually switch over to solely using an offset-basedrepresentation

This would give an immediate win over the current state of play, with apath towards full zero-copy arrow compatibility in the future, whilstavoiding bifurcating the arrow specification/implementations?


Kind Regards,

Raphael

On 02/10/2023 20:23, Benjamin Kietzman wrote:

@Antoine

By construction, a standard cannot continuously chase the performance

state of

art, it has to weigh the benefits of performance improvements against the
increased cost for the ecosystem.
We have extension types which could reasonably be used for non-standard
data types, especially the kind that are motivated by leading-edge
performance research and innovation and come with unusual constraints
(such as requiring trusting and dereferencing raw pointers embedded in
data buffers). There could even be an argument for making some of them
canonical extension types if there's enough anteriority in favor.

I agree that the standard becoming unfocused and ala carte is to be avoided.
However I would argue that the addition of raw pointer views as a C ABI
extension type represents little of that danger. The addition is entirely
bounded as it is binary (with or without raw pointers), and is semantically
identical to the index/offset representation with the caveat of differing
dereferencing.

This last makes it more nuanced than the cases covered by canonical
extension
types. If raw pointer views were a canonical extension using index/offset
views
as their storage, they could not be naively validated by considering the
storage
alone since raw pointers couldn't be validly reinterpreting as an
index/offset pair.
In light of this and in the context of the C ABI I'd advocate for a
dedicated
data type descriptor [1] for raw pointer views to make the distinction more
obvious,
but otherwise I think it is reasonable to consider them an extension type.

I am currently working on a draft PR adding this type to the C ABI spec by
way of
a proposal for everyone's consideration.

@Raphael

Do you have any benchmarks comparing kernels with native pointer array

support,

compared to those that must first convert to the offset representation? I

think

this would help ground this discussion empirically.

Conversion cost can end up being a bottleneck for rapid operations, much
more so
than the overhead of accessing either representation (hence the advantage of
operating in place on whatever representation we have). For a simple
comparison,
I benchmarked checking random views in each representation for alphanumeric
characters. Raw pointer views have a 6% advantage.

```
Run on (16 X 5300 MHz CPU s)
CPU Caches:
   L1 Data 32 KiB (x8)
   L1 Instruction 32 KiB (x8)
   L2 Unified 256 KiB (x8)
   L3 Unified 16384 KiB (x1)
-----------------------------------------------------------------------------------------------
Benchmark                                     Time             CPU
Iterations UserCounters...
-----------------------------------------------------------------------------------------------
IsAlnum<kIndexOffsetViews>             24086566 ns     24084367 ns
   29 bytes_per_second=664.276M/s items_per_second=43.5376M/s
IsAlnum<kRawPointerViews>              22613218 ns     22612044 ns
   31 bytes_per_second=707.529M/s items_per_second=46.3725M/s
```

Below is a table of benchmarks for conversion between representations.
To summarize those:

- Conversion cost is strongly dependent on string lengths.
   - I would tentatively focus on `kUsuallyInlineable` as most representative
- Conversion from views to dense strings is slowest since we must copy each
   view's characters into a single character buffer.
   - comparable to parsing 64 bit floats
   - preallocation is not perfect, so the character buffer must be check for
     resizing inside the hot loop
- Conversion from dense strings to views is 2-4x faster
   - comparable to parsing 64 bit integers
   - we only need to allocate once before conversion
   - we still need to access the character buffer in order to copy inline
contents
     or cached prefixes into the headers
- Conversion from index/offset to raw pointer views is fairly quick
   - comparable to integer->integer conversion with safe overflow and high
null percentage
- Conversion from raw pointer to index/offset views is 2/3 as fast

```
--------------------------------------------------------------------------------------------------------------------------------------
Benchmark
          Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------------------
ConvertViews<(from type), (to type), (length category)>
--------------------------------------------------------------------------------------------------------------------------------------
ConvertViews<kStrings, kIndexOffsetViews, kAlwaysInlineable>
    42877057 ns     42875885 ns           16 items_per_second=32.6081M/s
                                           kUsuallyInlineable>
   34079672 ns     34075604 ns           21 items_per_second=30.772M/s
                                           kShortButNeverInlineable>
   16044043 ns     16043702 ns           43 items_per_second=34.8573M/s
                                           kLongAndSeldomInlineable>
    1717984 ns      1717955 ns          376 items_per_second=38.1477M/s
                                           kLongAndNeverInlineable>
     1707074 ns      1706973 ns          413 items_per_second=38.3931M/s

ConvertViews<kIndexOffsetViews, kStrings, kAlwaysInlineable>
    85538939 ns     85532072 ns            8 items_per_second=16.3459M/s
                                           kUsuallyInlineable>
   66432452 ns     66417147 ns           10 items_per_second=15.7877M/s
                                           kShortButNeverInlineable>
   36025089 ns     36021631 ns           19 items_per_second=15.5251M/s
                                           kLongAndSeldomInlineable>
    8791312 ns      8789937 ns           80 items_per_second=7.4558M/s
                                           kLongAndNeverInlineable>
     6272905 ns      6272238 ns          112 items_per_second=10.4486M/s

ConvertViews<kRawPointerViews, kIndexOffsetViews, kAlwaysInlineable>
    15400749 ns     15400729 ns           45 items_per_second=90.7815M/s
                                                   kUsuallyInlineable>
   21527529 ns     21527622 ns           33 items_per_second=48.7084M/s
                                                   kShortButNeverInlineable>
   25101062 ns     25099755 ns           28 items_per_second=22.2807M/s
                                                   kLongAndSeldomInlineable>
    2665299 ns      2665111 ns          262 items_per_second=24.5903M/s
                                                   kLongAndNeverInlineable>
     2694563 ns      2694485 ns          260 items_per_second=24.3223M/s

ConvertViews<kIndexOffsetViews, kRawPointerViews, kAlwaysInlineable>
    15359965 ns     15358626 ns           46 items_per_second=91.0303M/s
                                                   kUsuallyInlineable>
   13967232 ns     13967093 ns           50 items_per_second=75.0748M/s
                                                   kShortButNeverInlineable>
    7861021 ns      7860546 ns           89 items_per_second=71.1452M/s
                                                   kLongAndSeldomInlineable>
     729323 ns       729272 ns          969 items_per_second=89.865M/s
                                                   kLongAndNeverInlineable>
      709887 ns       709827 ns          965 items_per_second=92.3267M/s
```

Sincerely,
Ben Kietzman

[1]
https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings

On Mon, Oct 2, 2023 at 9:22 AM Andrew Lamb <[email protected]> wrote:

I don't think "we have to adjust the Arrow format so that existing
internal representations become Arrow-compliant without any
(re-)implementation effort" is a reasonable design principle.

I agree with this statement from Antoine -- given the Arrow community has
standardized an addition to the format with StringView, I think it would
help to get some input from those at DuckDB and Velox on their perspective

Andrew




On Mon, Oct 2, 2023 at 9:17 AM Raphael Taylor-Davies
<[email protected]> wrote:

Oh I'm with you on it being a precedent we want to be very careful about
setting, but if there isn't a meaningful performance difference, we may
be able to sidestep that discussion entirely.

On 02/10/2023 14:11, Antoine Pitrou wrote:

Even if performance were significant better, I don't think it's a good
enough reason to add these representations to Arrow. By construction,
a standard cannot continuously chase the performance state of art, it
has to weigh the benefits of performance improvements against the
increased cost for the ecosystem (for example the cost of adapting to
frequent standard changes and a growing standard size).

We have extension types which could reasonably be used for
non-standard data types, especially the kind that are motivated by
leading-edge performance research and innovation and come with unusual
constraints (such as requiring trusting and dereferencing raw pointers
embedded in data buffers). There could even be an argument for making
some of them canonical extension types if there's enough anteriority
in favor.

Regards

Antoine.


Le 02/10/2023 à 15:00, Raphael Taylor-Davies a écrit :

I think what would really help would be some concrete numbers, do we
have any numbers comparing the performance of the offset and pointer
based representations? If there isn't a significant performance
difference between them, would the systems that currently use a
pointer-based approach be willing to meet us in the middle and switch

to

an offset based encoding? This to me feels like it would be the best
outcome for the ecosystem as a whole.

Kind Regards,

Raphael

On 02/10/2023 13:50, Antoine Pitrou wrote:

Le 01/10/2023 à 16:21, Micah Kornfield a écrit :

I would also assert that another way to reduce this risk is to add
some prose to the relevant sections of the columnar format
specification doc to clearly explain that a raw pointers variant of
the layout, while not part of the official spec, may be
implemented in
some Arrow libraries.

I've lost a little context but on all the concerns of adding raw
pointers
as an official option to the spec.  But I see making raw-pointer
variants
the best path forward.

Things captured from this thread or seem obvious at least to me:
1.  Divergence of IPC spec from in-memory/C-ABI spec?
2.  More parts of the spec to cover.
3.  In-compatibility with some languages
4.  Validation (in my mind different use-cases require different
levels of
validation, so this is a little bit less of a concern in my mind).

I think the broader issue is how we think about compatibility with
other
systems.  For instance, what happens if Velox and DuckDb start

adding

new
divergent memory layouts?  Are we expecting to add them to the spec?

This is a slippery slope. The more Arrow has a policy of integrating
existing practices simply because they exist, the more the Arrow
format will become _à la carte_, with different implementations
choosing to implement whatever they want to spend their engineering
effort on (you can see this occur, in part, on the Parquet format

with

its many different encodings, compression algorithms and a 96-bit
timestamp type).

We _have_ to think carefully about the middle- and long-term future

of

the format when adopting new features.

In this instance, we are doing a large part of the effort by adopting
a string view format with variadic buffers, inlined prefixes and
offset-based views into those buffers. But some implementations with
historically different internal representations will have to share
part of the effort to align with the newly standardized format.

I don't think "we have to adjust the Arrow format so that existing
internal representations become Arrow-compliant without any
(re-)implementation effort" is a reasonable design principle.

Regards

Antoine.

Re: [DISCUSS][C++] Raw pointer string views

Reply via email to