Le 01/10/2023 à 16:21, Micah Kornfield a écrit :
I would also assert that another way to reduce this risk is to add
some prose to the relevant sections of the columnar format
specification doc to clearly explain that a raw pointers variant of
the layout, while not part of the official spec, may be implemented in
some Arrow libraries.
I've lost a little context but on all the concerns of adding raw pointers
as an official option to the spec. But I see making raw-pointer variants
the best path forward.
Things captured from this thread or seem obvious at least to me:
1. Divergence of IPC spec from in-memory/C-ABI spec?
2. More parts of the spec to cover.
3. In-compatibility with some languages
4. Validation (in my mind different use-cases require different levels of
validation, so this is a little bit less of a concern in my mind).
I think the broader issue is how we think about compatibility with other
systems. For instance, what happens if Velox and DuckDb start adding new
divergent memory layouts? Are we expecting to add them to the spec?
This is a slippery slope. The more Arrow has a policy of integrating
existing practices simply because they exist, the more the Arrow format
will become _à la carte_, with different implementations choosing to
implement whatever they want to spend their engineering effort on (you
can see this occur, in part, on the Parquet format with its many
different encodings, compression algorithms and a 96-bit timestamp type).
We _have_ to think carefully about the middle- and long-term future of
the format when adopting new features.
In this instance, we are doing a large part of the effort by adopting a
string view format with variadic buffers, inlined prefixes and
offset-based views into those buffers. But some implementations with
historically different internal representations will have to share part
of the effort to align with the newly standardized format.
I don't think "we have to adjust the Arrow format so that existing
internal representations become Arrow-compliant without any
(re-)implementation effort" is a reasonable design principle.
Regards
Antoine.