@Antoine > By construction, a standard cannot continuously chase the performance state of > art, it has to weigh the benefits of performance improvements against the > increased cost for the ecosystem.
> We have extension types which could reasonably be used for non-standard > data types, especially the kind that are motivated by leading-edge > performance research and innovation and come with unusual constraints > (such as requiring trusting and dereferencing raw pointers embedded in > data buffers). There could even be an argument for making some of them > canonical extension types if there's enough anteriority in favor. I agree that the standard becoming unfocused and ala carte is to be avoided. However I would argue that the addition of raw pointer views as a C ABI extension type represents little of that danger. The addition is entirely bounded as it is binary (with or without raw pointers), and is semantically identical to the index/offset representation with the caveat of differing dereferencing. This last makes it more nuanced than the cases covered by canonical extension types. If raw pointer views were a canonical extension using index/offset views as their storage, they could not be naively validated by considering the storage alone since raw pointers couldn't be validly reinterpreting as an index/offset pair. In light of this and in the context of the C ABI I'd advocate for a dedicated data type descriptor [1] for raw pointer views to make the distinction more obvious, but otherwise I think it is reasonable to consider them an extension type. I am currently working on a draft PR adding this type to the C ABI spec by way of a proposal for everyone's consideration. @Raphael > Do you have any benchmarks comparing kernels with native pointer array support, > compared to those that must first convert to the offset representation? I think > this would help ground this discussion empirically. Conversion cost can end up being a bottleneck for rapid operations, much more so than the overhead of accessing either representation (hence the advantage of operating in place on whatever representation we have). For a simple comparison, I benchmarked checking random views in each representation for alphanumeric characters. Raw pointer views have a 6% advantage. ``` Run on (16 X 5300 MHz CPU s) CPU Caches: L1 Data 32 KiB (x8) L1 Instruction 32 KiB (x8) L2 Unified 256 KiB (x8) L3 Unified 16384 KiB (x1) ----------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ----------------------------------------------------------------------------------------------- IsAlnum<kIndexOffsetViews> 24086566 ns 24084367 ns 29 bytes_per_second=664.276M/s items_per_second=43.5376M/s IsAlnum<kRawPointerViews> 22613218 ns 22612044 ns 31 bytes_per_second=707.529M/s items_per_second=46.3725M/s ``` Below is a table of benchmarks for conversion between representations. To summarize those: - Conversion cost is strongly dependent on string lengths. - I would tentatively focus on `kUsuallyInlineable` as most representative - Conversion from views to dense strings is slowest since we must copy each view's characters into a single character buffer. - comparable to parsing 64 bit floats - preallocation is not perfect, so the character buffer must be check for resizing inside the hot loop - Conversion from dense strings to views is 2-4x faster - comparable to parsing 64 bit integers - we only need to allocate once before conversion - we still need to access the character buffer in order to copy inline contents or cached prefixes into the headers - Conversion from index/offset to raw pointer views is fairly quick - comparable to integer->integer conversion with safe overflow and high null percentage - Conversion from raw pointer to index/offset views is 2/3 as fast ``` -------------------------------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... -------------------------------------------------------------------------------------------------------------------------------------- ConvertViews<(from type), (to type), (length category)> -------------------------------------------------------------------------------------------------------------------------------------- ConvertViews<kStrings, kIndexOffsetViews, kAlwaysInlineable> 42877057 ns 42875885 ns 16 items_per_second=32.6081M/s kUsuallyInlineable> 34079672 ns 34075604 ns 21 items_per_second=30.772M/s kShortButNeverInlineable> 16044043 ns 16043702 ns 43 items_per_second=34.8573M/s kLongAndSeldomInlineable> 1717984 ns 1717955 ns 376 items_per_second=38.1477M/s kLongAndNeverInlineable> 1707074 ns 1706973 ns 413 items_per_second=38.3931M/s ConvertViews<kIndexOffsetViews, kStrings, kAlwaysInlineable> 85538939 ns 85532072 ns 8 items_per_second=16.3459M/s kUsuallyInlineable> 66432452 ns 66417147 ns 10 items_per_second=15.7877M/s kShortButNeverInlineable> 36025089 ns 36021631 ns 19 items_per_second=15.5251M/s kLongAndSeldomInlineable> 8791312 ns 8789937 ns 80 items_per_second=7.4558M/s kLongAndNeverInlineable> 6272905 ns 6272238 ns 112 items_per_second=10.4486M/s ConvertViews<kRawPointerViews, kIndexOffsetViews, kAlwaysInlineable> 15400749 ns 15400729 ns 45 items_per_second=90.7815M/s kUsuallyInlineable> 21527529 ns 21527622 ns 33 items_per_second=48.7084M/s kShortButNeverInlineable> 25101062 ns 25099755 ns 28 items_per_second=22.2807M/s kLongAndSeldomInlineable> 2665299 ns 2665111 ns 262 items_per_second=24.5903M/s kLongAndNeverInlineable> 2694563 ns 2694485 ns 260 items_per_second=24.3223M/s ConvertViews<kIndexOffsetViews, kRawPointerViews, kAlwaysInlineable> 15359965 ns 15358626 ns 46 items_per_second=91.0303M/s kUsuallyInlineable> 13967232 ns 13967093 ns 50 items_per_second=75.0748M/s kShortButNeverInlineable> 7861021 ns 7860546 ns 89 items_per_second=71.1452M/s kLongAndSeldomInlineable> 729323 ns 729272 ns 969 items_per_second=89.865M/s kLongAndNeverInlineable> 709887 ns 709827 ns 965 items_per_second=92.3267M/s ``` Sincerely, Ben Kietzman [1] https://arrow.apache.org/docs/format/CDataInterface.html#data-type-description-format-strings On Mon, Oct 2, 2023 at 9:22 AM Andrew Lamb <al...@influxdata.com> wrote: > > I don't think "we have to adjust the Arrow format so that existing > > internal representations become Arrow-compliant without any > > (re-)implementation effort" is a reasonable design principle. > > I agree with this statement from Antoine -- given the Arrow community has > standardized an addition to the format with StringView, I think it would > help to get some input from those at DuckDB and Velox on their perspective > > Andrew > > > > > On Mon, Oct 2, 2023 at 9:17 AM Raphael Taylor-Davies > <r.taylordav...@googlemail.com.invalid> wrote: > > > Oh I'm with you on it being a precedent we want to be very careful about > > setting, but if there isn't a meaningful performance difference, we may > > be able to sidestep that discussion entirely. > > > > On 02/10/2023 14:11, Antoine Pitrou wrote: > > > > > > Even if performance were significant better, I don't think it's a good > > > enough reason to add these representations to Arrow. By construction, > > > a standard cannot continuously chase the performance state of art, it > > > has to weigh the benefits of performance improvements against the > > > increased cost for the ecosystem (for example the cost of adapting to > > > frequent standard changes and a growing standard size). > > > > > > We have extension types which could reasonably be used for > > > non-standard data types, especially the kind that are motivated by > > > leading-edge performance research and innovation and come with unusual > > > constraints (such as requiring trusting and dereferencing raw pointers > > > embedded in data buffers). There could even be an argument for making > > > some of them canonical extension types if there's enough anteriority > > > in favor. > > > > > > Regards > > > > > > Antoine. > > > > > > > > > Le 02/10/2023 à 15:00, Raphael Taylor-Davies a écrit : > > >> I think what would really help would be some concrete numbers, do we > > >> have any numbers comparing the performance of the offset and pointer > > >> based representations? If there isn't a significant performance > > >> difference between them, would the systems that currently use a > > >> pointer-based approach be willing to meet us in the middle and switch > to > > >> an offset based encoding? This to me feels like it would be the best > > >> outcome for the ecosystem as a whole. > > >> > > >> Kind Regards, > > >> > > >> Raphael > > >> > > >> On 02/10/2023 13:50, Antoine Pitrou wrote: > > >>> > > >>> Le 01/10/2023 à 16:21, Micah Kornfield a écrit : > > >>>>> > > >>>>> I would also assert that another way to reduce this risk is to add > > >>>>> some prose to the relevant sections of the columnar format > > >>>>> specification doc to clearly explain that a raw pointers variant of > > >>>>> the layout, while not part of the official spec, may be > > >>>>> implemented in > > >>>>> some Arrow libraries. > > >>>> > > >>>> I've lost a little context but on all the concerns of adding raw > > >>>> pointers > > >>>> as an official option to the spec. But I see making raw-pointer > > >>>> variants > > >>>> the best path forward. > > >>>> > > >>>> Things captured from this thread or seem obvious at least to me: > > >>>> 1. Divergence of IPC spec from in-memory/C-ABI spec? > > >>>> 2. More parts of the spec to cover. > > >>>> 3. In-compatibility with some languages > > >>>> 4. Validation (in my mind different use-cases require different > > >>>> levels of > > >>>> validation, so this is a little bit less of a concern in my mind). > > >>>> > > >>>> I think the broader issue is how we think about compatibility with > > >>>> other > > >>>> systems. For instance, what happens if Velox and DuckDb start > adding > > >>>> new > > >>>> divergent memory layouts? Are we expecting to add them to the spec? > > >>> > > >>> This is a slippery slope. The more Arrow has a policy of integrating > > >>> existing practices simply because they exist, the more the Arrow > > >>> format will become _à la carte_, with different implementations > > >>> choosing to implement whatever they want to spend their engineering > > >>> effort on (you can see this occur, in part, on the Parquet format > with > > >>> its many different encodings, compression algorithms and a 96-bit > > >>> timestamp type). > > >>> > > >>> We _have_ to think carefully about the middle- and long-term future > of > > >>> the format when adopting new features. > > >>> > > >>> In this instance, we are doing a large part of the effort by adopting > > >>> a string view format with variadic buffers, inlined prefixes and > > >>> offset-based views into those buffers. But some implementations with > > >>> historically different internal representations will have to share > > >>> part of the effort to align with the newly standardized format. > > >>> > > >>> I don't think "we have to adjust the Arrow format so that existing > > >>> internal representations become Arrow-compliant without any > > >>> (re-)implementation effort" is a reasonable design principle. > > >>> > > >>> Regards > > >>> > > >>> Antoine. > > >