Hi all,

There is currently some discussion regarding how we can formalize/document
"well known" extension types (see the "[DISCUSS] New Types (Schema.fbs vs
Extension Types)" thread). There is ongoing work on an extension type to
store arrays / tensors by Rok (
https://issues.apache.org/jira/browse/ARROW-1614), and my colleague Dewey
and myself are looking at extension types for geospatial data.

Often, for an extension type, you will want to store some metadata in the
"ARROW:extension:metadata" field when serializing the type (see format docs
<https://arrow.apache.org/docs/format/Columnar.html#extension-types>, an
example metadata given there is {'type': 'int8', 'shape': [4, 5]} for a
tensor array). But the question is how to exactly format the data in this
field, assuming the value itself is also some form of key-value metadata.

Last Wednesday, I raised this question in the Arrow sync call, and copying
from the meeting notes:

- Joris asked how we should store key-value metadata for extension
types as a string; practical options seem limited to JSON or YAML;
JSON seems most reasonable

Also when implementing Arrow extension types in pandas (for some pandas
data types that don't have a direct mapping to an Arrow type), I (naively)
used a json dump because this is simply an easy solution when working in
Python (example
<https://github.com/pandas-dev/pandas/blob/7651c08230914ded8fceb93c990e2c859d51510d/pandas/core/arrays/_arrow_utils.py#L109-L111>
).

Now, if you have a JSON library available, using JSON for this is indeed
straightforward. But if we want that the metadata is also relatively easily
parse-able "by hand", there might be better alternatives?
In https://github.com/paleolimbot/geoarrow/, Dewey has been working on an R
package dealing with some extension types where the core is implemented in
C, and mentioned that dealing with JSON-like metadata would not be trivial
(or at least more complex than what's currently being used there, see
below).

One possible alternative could be to use the format as specified in the C
Data Interface for key-value metadata:
https://arrow.apache.org/docs/format/CDataInterface.html#c.ArrowSchema.metadata
(there it is used for the actual key-value metadata of a field, while here
it is for formatting a single value. But since for this discussion the
value is also a key-value mapping, the same scheme could be used).
(since this is a binary format, this assumes that the discussion about
allowing binary values in the key-value metadata in the IPC format gets
resolved)

Thoughts?

Joris

Reply via email to