@Antoine
> By construction, a standard cannot continuously chase the performance
state of
> art, it has to weigh the benefits of performance improvements against the
> increased cost for the ecosystem.
> We have extension types which could reasonably be used for non-standard
> data types, especially the kind that are motivated by leading-edge
> performance research and innovation and come with unusual constraints
> (such as requiring trusting and dereferencing raw pointers embedded in
> data buffers). There could even be an argument for making some of them
> canonical extension types if there's enough anteriority in favor.
I agree that the standard becoming unfocused and ala carte is to be avoided.
However I would argue that the addition of raw pointer views as a C ABI
extension type represents little of that danger. The addition is entirely
bounded as it is binary (with or without raw pointers), and is semantically
identical to the index/offset representation with the caveat of differing
dereferencing.
This last makes it more nuanced than the cases covered by canonical
extension
types. If raw pointer views were a canonical extension using index/offset
views
as their storage, they could not be naively validated by considering the
storage
alone since raw pointers couldn't be validly reinterpreting as an
index/offset pair.
In light of this and in the context of the C ABI I'd advocate for a
dedicated
data type descriptor [1] for raw pointer views to make the distinction more
obvious,
but otherwise I think it is reasonable to consider them an extension type.
I am currently working on a draft PR adding this type to the C ABI spec by
way of
a proposal for everyone's consideration.
@Raphael
> Do you have any benchmarks comparing kernels with native pointer array
support,
> compared to those that must first convert to the offset representation? I
think
> this would help ground this discussion empirically.
Conversion cost can end up being a bottleneck for rapid operations, much
more so
than the overhead of accessing either representation (hence the advantage of
operating in place on whatever representation we have). For a simple
comparison,
I benchmarked checking random views in each representation for alphanumeric
characters. Raw pointer views have a 6% advantage.
```
Run on (16 X 5300 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 256 KiB (x8)
L3 Unified 16384 KiB (x1)
-----------------------------------------------------------------------------------------------
Benchmark Time CPU
Iterations UserCounters...
-----------------------------------------------------------------------------------------------
IsAlnum<kIndexOffsetViews> 24086566 ns 24084367 ns
29 bytes_per_second=664.276M/s items_per_second=43.5376M/s
IsAlnum<kRawPointerViews> 22613218 ns 22612044 ns
31 bytes_per_second=707.529M/s items_per_second=46.3725M/s
```
Below is a table of benchmarks for conversion between representations.
To summarize those:
- Conversion cost is strongly dependent on string lengths.
- I would tentatively focus on `kUsuallyInlineable` as most representative
- Conversion from views to dense strings is slowest since we must copy each
view's characters into a single character buffer.
- comparable to parsing 64 bit floats
- preallocation is not perfect, so the character buffer must be check for
resizing inside the hot loop
- Conversion from dense strings to views is 2-4x faster
- comparable to parsing 64 bit integers
- we only need to allocate once before conversion
- we still need to access the character buffer in order to copy inline
contents
or cached prefixes into the headers
- Conversion from index/offset to raw pointer views is fairly quick
- comparable to integer->integer conversion with safe overflow and high
null percentage
- Conversion from raw pointer to index/offset views is 2/3 as fast
```
--------------------------------------------------------------------------------------------------------------------------------------
Benchmark
Time CPU Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------------------
ConvertViews<(from type), (to type), (length category)>
--------------------------------------------------------------------------------------------------------------------------------------
ConvertViews<kStrings, kIndexOffsetViews, kAlwaysInlineable>
42877057 ns 42875885 ns 16 items_per_second=32.6081M/s
kUsuallyInlineable>
34079672 ns 34075604 ns 21 items_per_second=30.772M/s
kShortButNeverInlineable>
16044043 ns 16043702 ns 43 items_per_second=34.8573M/s
kLongAndSeldomInlineable>
1717984 ns 1717955 ns 376 items_per_second=38.1477M/s
kLongAndNeverInlineable>
1707074 ns 1706973 ns 413 items_per_second=38.3931M/s
ConvertViews<kIndexOffsetViews, kStrings, kAlwaysInlineable>
85538939 ns 85532072 ns 8 items_per_second=16.3459M/s
kUsuallyInlineable>
66432452 ns 66417147 ns 10 items_per_second=15.7877M/s
kShortButNeverInlineable>
36025089 ns 36021631 ns 19 items_per_second=15.5251M/s
kLongAndSeldomInlineable>
8791312 ns 8789937 ns 80 items_per_second=7.4558M/s
kLongAndNeverInlineable>
6272905 ns 6272238 ns 112 items_per_second=10.4486M/s
ConvertViews<kRawPointerViews, kIndexOffsetViews, kAlwaysInlineable>
15400749 ns 15400729 ns 45 items_per_second=90.7815M/s
kUsuallyInlineable>
21527529 ns 21527622 ns 33 items_per_second=48.7084M/s
kShortButNeverInlineable>
25101062 ns 25099755 ns 28 items_per_second=22.2807M/s
kLongAndSeldomInlineable>
2665299 ns 2665111 ns 262 items_per_second=24.5903M/s
kLongAndNeverInlineable>
2694563 ns 2694485 ns 260 items_per_second=24.3223M/s
ConvertViews<kIndexOffsetViews, kRawPointerViews, kAlwaysInlineable>
15359965 ns 15358626 ns 46 items_per_second=91.0303M/s
kUsuallyInlineable>
13967232 ns 13967093 ns 50 items_per_second=75.0748M/s
kShortButNeverInlineable>
7861021 ns 7860546 ns 89 items_per_second=71.1452M/s
kLongAndSeldomInlineable>
729323 ns 729272 ns 969 items_per_second=89.865M/s
kLongAndNeverInlineable>
709887 ns 709827 ns 965 items_per_second=92.3267M/s
```
Sincerely,
Ben Kietzman
[1]
https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings
On Mon, Oct 2, 2023 at 9:22 AM Andrew Lamb <[email protected]> wrote:
> > I don't think "we have to adjust the Arrow format so that existing
> > internal representations become Arrow-compliant without any
> > (re-)implementation effort" is a reasonable design principle.
>
> I agree with this statement from Antoine -- given the Arrow community has
> standardized an addition to the format with StringView, I think it would
> help to get some input from those at DuckDB and Velox on their perspective
>
> Andrew
>
>
>
>
> On Mon, Oct 2, 2023 at 9:17 AM Raphael Taylor-Davies
> <[email protected]> wrote:
>
> > Oh I'm with you on it being a precedent we want to be very careful about
> > setting, but if there isn't a meaningful performance difference, we may
> > be able to sidestep that discussion entirely.
> >
> > On 02/10/2023 14:11, Antoine Pitrou wrote:
> > >
> > > Even if performance were significant better, I don't think it's a good
> > > enough reason to add these representations to Arrow. By construction,
> > > a standard cannot continuously chase the performance state of art, it
> > > has to weigh the benefits of performance improvements against the
> > > increased cost for the ecosystem (for example the cost of adapting to
> > > frequent standard changes and a growing standard size).
> > >
> > > We have extension types which could reasonably be used for
> > > non-standard data types, especially the kind that are motivated by
> > > leading-edge performance research and innovation and come with unusual
> > > constraints (such as requiring trusting and dereferencing raw pointers
> > > embedded in data buffers). There could even be an argument for making
> > > some of them canonical extension types if there's enough anteriority
> > > in favor.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 02/10/2023 à 15:00, Raphael Taylor-Davies a écrit :
> > >> I think what would really help would be some concrete numbers, do we
> > >> have any numbers comparing the performance of the offset and pointer
> > >> based representations? If there isn't a significant performance
> > >> difference between them, would the systems that currently use a
> > >> pointer-based approach be willing to meet us in the middle and switch
> to
> > >> an offset based encoding? This to me feels like it would be the best
> > >> outcome for the ecosystem as a whole.
> > >>
> > >> Kind Regards,
> > >>
> > >> Raphael
> > >>
> > >> On 02/10/2023 13:50, Antoine Pitrou wrote:
> > >>>
> > >>> Le 01/10/2023 à 16:21, Micah Kornfield a écrit :
> > >>>>>
> > >>>>> I would also assert that another way to reduce this risk is to add
> > >>>>> some prose to the relevant sections of the columnar format
> > >>>>> specification doc to clearly explain that a raw pointers variant of
> > >>>>> the layout, while not part of the official spec, may be
> > >>>>> implemented in
> > >>>>> some Arrow libraries.
> > >>>>
> > >>>> I've lost a little context but on all the concerns of adding raw
> > >>>> pointers
> > >>>> as an official option to the spec. But I see making raw-pointer
> > >>>> variants
> > >>>> the best path forward.
> > >>>>
> > >>>> Things captured from this thread or seem obvious at least to me:
> > >>>> 1. Divergence of IPC spec from in-memory/C-ABI spec?
> > >>>> 2. More parts of the spec to cover.
> > >>>> 3. In-compatibility with some languages
> > >>>> 4. Validation (in my mind different use-cases require different
> > >>>> levels of
> > >>>> validation, so this is a little bit less of a concern in my mind).
> > >>>>
> > >>>> I think the broader issue is how we think about compatibility with
> > >>>> other
> > >>>> systems. For instance, what happens if Velox and DuckDb start
> adding
> > >>>> new
> > >>>> divergent memory layouts? Are we expecting to add them to the spec?
> > >>>
> > >>> This is a slippery slope. The more Arrow has a policy of integrating
> > >>> existing practices simply because they exist, the more the Arrow
> > >>> format will become _à la carte_, with different implementations
> > >>> choosing to implement whatever they want to spend their engineering
> > >>> effort on (you can see this occur, in part, on the Parquet format
> with
> > >>> its many different encodings, compression algorithms and a 96-bit
> > >>> timestamp type).
> > >>>
> > >>> We _have_ to think carefully about the middle- and long-term future
> of
> > >>> the format when adopting new features.
> > >>>
> > >>> In this instance, we are doing a large part of the effort by adopting
> > >>> a string view format with variadic buffers, inlined prefixes and
> > >>> offset-based views into those buffers. But some implementations with
> > >>> historically different internal representations will have to share
> > >>> part of the effort to align with the newly standardized format.
> > >>>
> > >>> I don't think "we have to adjust the Arrow format so that existing
> > >>> internal representations become Arrow-compliant without any
> > >>> (re-)implementation effort" is a reasonable design principle.
> > >>>
> > >>> Regards
> > >>>
> > >>> Antoine.
> >
>