Hi all, There is currently some discussion regarding how we can formalize/document "well known" extension types (see the "[DISCUSS] New Types (Schema.fbs vs Extension Types)" thread). There is ongoing work on an extension type to store arrays / tensors by Rok ( https://issues.apache.org/jira/browse/ARROW-1614), and my colleague Dewey and myself are looking at extension types for geospatial data.
Often, for an extension type, you will want to store some metadata in the "ARROW:extension:metadata" field when serializing the type (see format docs <https://arrow.apache.org/docs/format/Columnar.html#extension-types>, an example metadata given there is {'type': 'int8', 'shape': [4, 5]} for a tensor array). But the question is how to exactly format the data in this field, assuming the value itself is also some form of key-value metadata. Last Wednesday, I raised this question in the Arrow sync call, and copying from the meeting notes: - Joris asked how we should store key-value metadata for extension types as a string; practical options seem limited to JSON or YAML; JSON seems most reasonable Also when implementing Arrow extension types in pandas (for some pandas data types that don't have a direct mapping to an Arrow type), I (naively) used a json dump because this is simply an easy solution when working in Python (example <https://github.com/pandas-dev/pandas/blob/7651c08230914ded8fceb93c990e2c859d51510d/pandas/core/arrays/_arrow_utils.py#L109-L111> ). Now, if you have a JSON library available, using JSON for this is indeed straightforward. But if we want that the metadata is also relatively easily parse-able "by hand", there might be better alternatives? In https://github.com/paleolimbot/geoarrow/, Dewey has been working on an R package dealing with some extension types where the core is implemented in C, and mentioned that dealing with JSON-like metadata would not be trivial (or at least more complex than what's currently being used there, see below). One possible alternative could be to use the format as specified in the C Data Interface for key-value metadata: https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema.metadata (there it is used for the actual key-value metadata of a field, while here it is for formatting a single value. But since for this discussion the value is also a key-value mapping, the same scheme could be used). (since this is a binary format, this assumes that the discussion about allowing binary values in the key-value metadata in the IPC format gets resolved) Thoughts? Joris