Re: [DISCUSS][C++] Raw pointer string views

Raphael Taylor-Davies Mon, 02 Oct 2023 06:00:31 -0700

I think what would really help would be some concrete numbers, do wehave any numbers comparing the performance of the offset and pointerbased representations? If there isn't a significant performancedifference between them, would the systems that currently use apointer-based approach be willing to meet us in the middle and switch toan offset based encoding? This to me feels like it would be the bestoutcome for the ecosystem as a whole.


Kind Regards,


Raphael

On 02/10/2023 13:50, Antoine Pitrou wrote:

Le 01/10/2023 à 16:21, Micah Kornfield a écrit :
I would also assert that another way to reduce this risk is to add
some prose to the relevant sections of the columnar format
specification doc to clearly explain that a raw pointers variant of
the layout, while not part of the official spec, may be implemented in
some Arrow libraries.
I've lost a little context but on all the concerns of adding rawpointersas an official option to the spec. But I see making raw-pointervariants
the best path forward.

Things captured from this thread or seem obvious at least to me:
1.  Divergence of IPC spec from in-memory/C-ABI spec?
2.  More parts of the spec to cover.
3.  In-compatibility with some languages
4. Validation (in my mind different use-cases require differentlevels of
validation, so this is a little bit less of a concern in my mind).

I think the broader issue is how we think about compatibility with other
systems. For instance, what happens if Velox and DuckDb start addingnew
divergent memory layouts?  Are we expecting to add them to the spec?
This is a slippery slope. The more Arrow has a policy of integratingexisting practices simply because they exist, the more the Arrowformat will become _à la carte_, with different implementationschoosing to implement whatever they want to spend their engineeringeffort on (you can see this occur, in part, on the Parquet format withits many different encodings, compression algorithms and a 96-bittimestamp type).
We _have_ to think carefully about the middle- and long-term future ofthe format when adopting new features.
In this instance, we are doing a large part of the effort by adoptinga string view format with variadic buffers, inlined prefixes andoffset-based views into those buffers. But some implementations withhistorically different internal representations will have to sharepart of the effort to align with the newly standardized format.
I don't think "we have to adjust the Arrow format so that existinginternal representations become Arrow-compliant without any(re-)implementation effort" is a reasonable design principle.
Regards

Antoine.

Re: [DISCUSS][C++] Raw pointer string views

Reply via email to