>
> One possible alternative could be to use the format as specified in the C
> Data Interface for key-value metadata:
>
> https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema.metadata
> (there it is used for the actual key-value metadata of a field, while here
> it is for formatting a single value. But since for this discussion the
> value is also a key-value mapping, the same scheme could be used).
> (since this is a binary format, this assumes that the discussion about
> allowing binary values in the key-value metadata in the IPC format gets
> resolved)

I think it likely depends on the complexity of the metadata.  If your
values are themselves complex, then using something like JSON or another
existing serialization format makes sense (e.g. this could also be
flatbuffers, protobuf).

An alternative approach is to consider ARROW-15613 a bug and do not change
> the spec - require consumers to encode the binary data in a string
> representation like base64.


> My sense, that while onerous updating the specification is probably going
to be the safest way to avoid breaking existing users.  I would imagine the
process to get C++ compliant again would be:
1.  Add the ability to store arbitrary bytes to the specification.
2.  Start duplicating existing data between the two fields.
3.  At some point later, stop producing non-spec compliant data in C++


I just think it is important that we are consistent between the IPC and the
> c data interface.

+1

On Tue, Feb 8, 2022 at 8:38 AM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Hi,
>
> Great questions and write up. Thanks!
>
> imo dragging a JSON reader and writer to read official extension types'
> metadata seems overkill. The c data interface is expected to be quite low
> level. Imo we should aim for a (non-human readable) binary format. For
> non-official, imo you are spot on - use what best fits to the use-case or
> application. If the application is storing other metadata in json, json may
> make sense, in Python pickle is another option, flatbuffers or something
> like that is also ok imo.
>
> Wrt to binary, imo the challenge is:
> * we state that backward incompatible changes to the c data interface
> require a new spec [1]
> * we state that the metadata is a binary string [2]
> * a valid string is a subset of all valid byte arrays and thus removing "
> *string*" from the spec is backward incompatible
>
> If we write invalid utf8 to it and a reader assumes utf8 when reading it,
> we trigger undefined behavior.
>
> I was a bit surprised by ARROW-15613 - my understanding is that the c++
> implementation is not following the spec, and if we at arrow2 were not be
> checking for utf8, we would be exposing a vulnerability (at least according
> to Rust's standards). We just checked it out of luck (it is O(1), so why
> not).
>
> What is the concern with string-encoding binary like base64?
>
> Given that one of our reference implementations is not following the spec
> and there is value in allowing arbitrary bytes on the metadata values, we
> may as well just update the spec to align with the reference
> implementation? If we do that, I would suggest that we do it both in the c
> data interface and the IPC specification, since imo it is quite important
> that an extension can flow all the way through IPC and c data interface.
>
> An alternative approach is to consider ARROW-15613 a bug and do not change
> the spec - require consumers to encode the binary data in a string
> representation like base64.
>
> I just think it is important that we are consistent between the IPC and the
> c data interface.
>
> For reference, Polars uses base64 encoding of Python blobs (pickle,
> pointers, etc.) because we enforce the spec on arrow2.
>
> Best,
> Jorge
>
> [1]
>
> https://arrow.apache.org/docs/format/CDataInterface.html#updating-this-specification
> [2]
>
> https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema.metadata
> [ARROW-15613
> <https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema.metadata[ARROW-15613>]
> https://issues.apache.org/jira/browse/ARROW-15613
>

Reply via email to