hello all, I wanted to kick-start the process of coming up with a standardized / canonical metadata specification that we can use for describing Arrow data to be moved between systems. This breaks down into at least two distinct kinds of metadata
1) "Schemas": physical types, logical types, child array types, struct field names, and so forth. Does not contain information about the size of the actual physical data (which depends on the length of arrays and the sizes of list/variable-length type dimensions). 2) "Data headers": a description of the shape of a physical chunk of data associated with a particular schema. Array length, null count, memory buffer offsets and sizes, etc. This is the information you need to compute the right pointers into a shared memory region or IPC/RPC buffer and reconstruct Arrow container classes. Since #2 will depend on some of the details of #1, I suggest we start figuring out #1 first. As far as the type metadata is concerned, to avoid excess bike shedding we should break that problem into: A) The general layout of the type metadata / schemas B) The technology we use for representing the schemas (and data headers) in an implementation-independent way for use in an IPC/RPC setting (and even to "store" ephemeral data on disk) On Item B, from what I've seen with Parquet and such file formats with embedded metadata, and in the spirit of Arrow's "deserialize-nothing" ethos, I suggest we explore no-deserialization technologies like Google's Flatbuffers (https://github.com/google/flatbuffers) as a more CPU-efficient alternative to Thrift, Protobuf, or Avro. In large schemas, technologies like Thrift can result in significant overhead in "needle-in-haystack" problems where you are picking only a few columns out of very wide tables (> 1000s of columns), and it may be best to try to avoid this if at all possible. I would like some help stewarding the design process on this from the Arrow PMC and in particular those who have worked on the design and implementation of Parquet and other file formats and systems for which Arrow is an immediate intended companion. Lot of things we can learn from those past experiences. Thank you, Wes