hello all,

I wanted to kick-start the process of coming up with a standardized /
canonical metadata specification that we can use for describing Arrow
data to be moved between systems. This breaks down into at least two
distinct kinds of metadata

1) "Schemas": physical types, logical types, child array types, struct
field names, and so forth. Does not contain information about the size
of the actual physical data (which depends on the length of arrays and
the sizes of list/variable-length type dimensions).

2) "Data headers": a description of the shape of a physical chunk of
data associated with a particular schema. Array length, null count,
memory buffer offsets and sizes, etc. This is the information you need
to compute the right pointers into a shared memory region or IPC/RPC
buffer and reconstruct Arrow container classes.

Since #2 will depend on some of the details of #1, I suggest we start
figuring out #1 first. As far as the type metadata is concerned, to
avoid excess bike shedding we should break that problem into:

A) The general layout of the type metadata / schemas
B) The technology we use for representing the schemas (and data
headers) in an implementation-independent way for use in an IPC/RPC
setting (and even to "store" ephemeral data on disk)

On Item B, from what I've seen with Parquet and such file formats with
embedded metadata, and in the spirit of Arrow's "deserialize-nothing"
ethos, I suggest we explore no-deserialization technologies like
Google's Flatbuffers (https://github.com/google/flatbuffers) as a more
CPU-efficient alternative to Thrift, Protobuf, or Avro. In large
schemas, technologies like Thrift can result in significant overhead
in "needle-in-haystack" problems where you are picking only a few
columns out of very wide tables (> 1000s of columns), and it may be
best to try to avoid this if at all possible.

I would like some help stewarding the design process on this from the
Arrow PMC and in particular those who have worked on the design and
implementation of Parquet and other file formats and systems for which
Arrow is an immediate intended companion. Lot of things we can learn
from those past experiences.

Thank you,
Wes

Reply via email to