I think what would really help would be some concrete numbers, do we have any numbers comparing the performance of the offset and pointer based representations? If there isn't a significant performance difference between them, would the systems that currently use a pointer-based approach be willing to meet us in the middle and switch to an offset based encoding? This to me feels like it would be the best outcome for the ecosystem as a whole.

Kind Regards,

Raphael

On 02/10/2023 13:50, Antoine Pitrou wrote:

Le 01/10/2023 à 16:21, Micah Kornfield a écrit :

I would also assert that another way to reduce this risk is to add
some prose to the relevant sections of the columnar format
specification doc to clearly explain that a raw pointers variant of
the layout, while not part of the official spec, may be implemented in
some Arrow libraries.

I've lost a little context but on all the concerns of adding raw pointers as an official option to the spec.  But I see making raw-pointer variants
the best path forward.

Things captured from this thread or seem obvious at least to me:
1.  Divergence of IPC spec from in-memory/C-ABI spec?
2.  More parts of the spec to cover.
3.  In-compatibility with some languages
4.  Validation (in my mind different use-cases require different levels of
validation, so this is a little bit less of a concern in my mind).

I think the broader issue is how we think about compatibility with other
systems.  For instance, what happens if Velox and DuckDb start adding new
divergent memory layouts?  Are we expecting to add them to the spec?

This is a slippery slope. The more Arrow has a policy of integrating existing practices simply because they exist, the more the Arrow format will become _à la carte_, with different implementations choosing to implement whatever they want to spend their engineering effort on (you can see this occur, in part, on the Parquet format with its many different encodings, compression algorithms and a 96-bit timestamp type).

We _have_ to think carefully about the middle- and long-term future of the format when adopting new features.

In this instance, we are doing a large part of the effort by adopting a string view format with variadic buffers, inlined prefixes and offset-based views into those buffers. But some implementations with historically different internal representations will have to share part of the effort to align with the newly standardized format.

I don't think "we have to adjust the Arrow format so that existing internal representations become Arrow-compliant without any (re-)implementation effort" is a reasonable design principle.

Regards

Antoine.

Reply via email to