Hi,

Great questions and write up. Thanks!

imo dragging a JSON reader and writer to read official extension types'
metadata seems overkill. The c data interface is expected to be quite low
level. Imo we should aim for a (non-human readable) binary format. For
non-official, imo you are spot on - use what best fits to the use-case or
application. If the application is storing other metadata in json, json may
make sense, in Python pickle is another option, flatbuffers or something
like that is also ok imo.

Wrt to binary, imo the challenge is:
* we state that backward incompatible changes to the c data interface
require a new spec [1]
* we state that the metadata is a binary string [2]
* a valid string is a subset of all valid byte arrays and thus removing "
*string*" from the spec is backward incompatible

If we write invalid utf8 to it and a reader assumes utf8 when reading it,
we trigger undefined behavior.

I was a bit surprised by ARROW-15613 - my understanding is that the c++
implementation is not following the spec, and if we at arrow2 were not be
checking for utf8, we would be exposing a vulnerability (at least according
to Rust's standards). We just checked it out of luck (it is O(1), so why
not).

What is the concern with string-encoding binary like base64?

Given that one of our reference implementations is not following the spec
and there is value in allowing arbitrary bytes on the metadata values, we
may as well just update the spec to align with the reference
implementation? If we do that, I would suggest that we do it both in the c
data interface and the IPC specification, since imo it is quite important
that an extension can flow all the way through IPC and c data interface.

An alternative approach is to consider ARROW-15613 a bug and do not change
the spec - require consumers to encode the binary data in a string
representation like base64.

I just think it is important that we are consistent between the IPC and the
c data interface.

For reference, Polars uses base64 encoding of Python blobs (pickle,
pointers, etc.) because we enforce the spec on arrow2.

Best,
Jorge

[1]
https://arrow.apache.org/docs/format/CDataInterface.html#updating-this-specification
[2]
https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema.metadata
[ARROW-15613] https://issues.apache.org/jira/browse/ARROW-15613

Reply via email to