Hi, Great questions and write up. Thanks!
imo dragging a JSON reader and writer to read official extension types' metadata seems overkill. The c data interface is expected to be quite low level. Imo we should aim for a (non-human readable) binary format. For non-official, imo you are spot on - use what best fits to the use-case or application. If the application is storing other metadata in json, json may make sense, in Python pickle is another option, flatbuffers or something like that is also ok imo. Wrt to binary, imo the challenge is: * we state that backward incompatible changes to the c data interface require a new spec [1] * we state that the metadata is a binary string [2] * a valid string is a subset of all valid byte arrays and thus removing " *string*" from the spec is backward incompatible If we write invalid utf8 to it and a reader assumes utf8 when reading it, we trigger undefined behavior. I was a bit surprised by ARROW-15613 - my understanding is that the c++ implementation is not following the spec, and if we at arrow2 were not be checking for utf8, we would be exposing a vulnerability (at least according to Rust's standards). We just checked it out of luck (it is O(1), so why not). What is the concern with string-encoding binary like base64? Given that one of our reference implementations is not following the spec and there is value in allowing arbitrary bytes on the metadata values, we may as well just update the spec to align with the reference implementation? If we do that, I would suggest that we do it both in the c data interface and the IPC specification, since imo it is quite important that an extension can flow all the way through IPC and c data interface. An alternative approach is to consider ARROW-15613 a bug and do not change the spec - require consumers to encode the binary data in a string representation like base64. I just think it is important that we are consistent between the IPC and the c data interface. For reference, Polars uses base64 encoding of Python blobs (pickle, pointers, etc.) because we enforce the spec on arrow2. Best, Jorge [1] https://arrow.apache.org/docs/format/CDataInterface.html#updating-this-specification [2] https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema.metadata [ARROW-15613] https://issues.apache.org/jira/browse/ARROW-15613