Wes, thanks for starting this conversation. Couple thoughts:
For metadata, we have two models existing (one in the ValueVectors approach and one in Parquet). It seems like we should start from one of those and then shape as appropriate. It seems like we have a richer physical capability that the core Dremel algorithm that Parquet implements so I think it would make sense to focus first on the logical model and then figure out the shared physical that exists below that. While the Data Headers item (2) in your description may come logically second, I think that it greatly informs 1.B as I believe 2 is something that should be an in-memory canonical representation (similar to the vectors themselves). I know Steven has been looking at moving the Java layer over to serialize the data headers using something similar to this: Data headers use a deterministic pre-order "tree" ordering of the memory buffers (https://en.wikipedia.org/wiki/Tree_traversal). The data structures are simply an array of data headers consisting of a list of buffer offsets and sizes. For example, consider this schema: List<Struct<String=List<UInt8>, Int32>> the pre-order buffer order is 0: nulls top level list 1: list offsets 2: struct field 0 nulls 3: struct field 0 list offsets 4: struct field 0 inner UInt8 values 5: struct field 1 nulls 6: struct field 1 Int32 values The flatbuffer schema for the data header would then be: namespace DataHeaders; struct Buffer { data: long; length: int; } // Representing a single array (aka ValueVector), typically table BufferList { // With FBS it is not possible to know the length of an array n_buffers: int; buffers: [Buffer]; } // Multiple arrays -- could be used for long arrays or a // whole table row batch table ArrayBatch { n_arrays: int; arrays: [BufferList]; } On Mon, Feb 29, 2016 at 6:13 PM, Wes McKinney <w...@cloudera.com> wrote: > hello all, > > I wanted to kick-start the process of coming up with a standardized / > canonical metadata specification that we can use for describing Arrow > data to be moved between systems. This breaks down into at least two > distinct kinds of metadata > > 1) "Schemas": physical types, logical types, child array types, struct > field names, and so forth. Does not contain information about the size > of the actual physical data (which depends on the length of arrays and > the sizes of list/variable-length type dimensions). > > 2) "Data headers": a description of the shape of a physical chunk of > data associated with a particular schema. Array length, null count, > memory buffer offsets and sizes, etc. This is the information you need > to compute the right pointers into a shared memory region or IPC/RPC > buffer and reconstruct Arrow container classes. > > Since #2 will depend on some of the details of #1, I suggest we start > figuring out #1 first. As far as the type metadata is concerned, to > avoid excess bike shedding we should break that problem into: > > A) The general layout of the type metadata / schemas > B) The technology we use for representing the schemas (and data > headers) in an implementation-independent way for use in an IPC/RPC > setting (and even to "store" ephemeral data on disk) > > On Item B, from what I've seen with Parquet and such file formats with > embedded metadata, and in the spirit of Arrow's "deserialize-nothing" > ethos, I suggest we explore no-deserialization technologies like > Google's Flatbuffers (https://github.com/google/flatbuffers) as a more > CPU-efficient alternative to Thrift, Protobuf, or Avro. In large > schemas, technologies like Thrift can result in significant overhead > in "needle-in-haystack" problems where you are picking only a few > columns out of very wide tables (> 1000s of columns), and it may be > best to try to avoid this if at all possible. > > I would like some help stewarding the design process on this from the > Arrow PMC and in particular those who have worked on the design and > implementation of Parquet and other file formats and systems for which > Arrow is an immediate intended companion. Lot of things we can learn > from those past experiences. > > Thank you, > Wes >