Hi, Sorry if this is a n00b question. For the example that Jacques used in the previous thread, how does it work if the struct is nullable - shouldn't there be a is_null array between 1 and 2?
For instance, with the example: list elem #0: { <f0: aaa, f1: 42>, null, <f0: null, f1: 28> } list elem #1: { <f0: bbb, f1: null> } list elem #2: null list elem #3: { <f0: null, f1: null> } what does the encoding look like? I'm thinking: (0 = false, 1 = true in is_null arrays): 0 (list is_null): 0 0 1 0 1 (list offset): 0 3 4 4 5 2 (struct is_null): 0 1 0 0 0 3 (struct field f0 is_null): 0 1 0 1 4 (struct field f0 offset): 0 3 3 6 6 5 (struct field f0 value): a a a b b b 6 (struct field f1 is_null): 1 1 0 0 7 (struct field f1 value): 42 28 Let me know if this is wrong. Thanks! Best, Chao On Tue, Mar 1, 2016 at 1:13 PM, Wes McKinney <w...@cloudera.com> wrote: > Inline responses > > On Tue, Mar 1, 2016 at 8:24 AM, Jacques Nadeau <jacq...@apache.org> wrote: > > Wes, thanks for starting this conversation. > > > > Couple thoughts: > > > > For metadata, we have two models existing (one in the ValueVectors > approach > > and one in Parquet). It seems like we should start from one of those and > > then shape as appropriate. It seems like we have a richer physical > > capability that the core Dremel algorithm that Parquet implements so I > > think it would make sense to focus first on the logical model and then > > figure out the shared physical that exists below that. > > > > While the Data Headers item (2) in your description may come logically > > second, I think that it greatly informs 1.B as I believe 2 is something > > that should be an in-memory canonical representation (similar to the > > vectors themselves). I know Steven has been looking at moving the Java > > layer over to serialize the data headers using something similar to this: > > > > Data headers use a deterministic pre-order "tree" ordering of the memory > > buffers (https://en.wikipedia.org/wiki/Tree_traversal). The data > structures > > are simply an array of data headers consisting of a list of buffer > offsets > > and sizes. > > > > This makes sense, and is consistent with Parquet's depth-first > (pre-order) flattening of nested schemas into a flat > list<SchemaElement> > ( > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L556 > ). > > > For example, consider this schema: > > > > List<Struct<String=List<UInt8>, Int32>> > > > > the pre-order buffer order is > > > > 0: nulls top level list > > > > 1: list offsets > > > > 2: struct field 0 nulls > > > > 3: struct field 0 list offsets > > > > 4: struct field 0 inner UInt8 values > > > > 5: struct field 1 nulls > > > > 6: struct field 1 Int32 values > > > > Regarding the buffers, we would want to embed the null count for each > logical array in some way, so that the null bitmap can be omitted. > This also spares having to compute the actual null counts when > receiving a row batch. > > For example, a List<T>, we could have a couple cases: > > List > - null_count > 0 -> 2 buffers, 1 for the null bitmap, 1 for the list > offsets > - null_count == 0 -> 1 buffer for the list offsets > > Recursively, the buffers for the type T child data would similarly > have its own null count, determining the actual number of memory > buffers. Let me know what you think. > > > The flatbuffer schema for the data header would then be: > > > > namespace DataHeaders; > > > > struct Buffer { > > > > data: long; > > > > length: int; > > > > } > > > > // Representing a single array (aka ValueVector), typically > > > > table BufferList { > > > > // With FBS it is not possible to know the length of an array > > > > n_buffers: int; > > > > buffers: [Buffer]; > > > > } > > > > // Multiple arrays -- could be used for long arrays or a > > > > // whole table row batch > > > > table ArrayBatch { > > > > n_arrays: int; > > > > arrays: [BufferList]; > > > > } > > I've been tinkering with flatbuffers, and array lengths are indeed > available in the generated IDL (see flatbuffers::Vector::size in the > C++ API, for example), so the n_buffers / n_arrays fields aren't > needed. > > Per the discussion above re: null counts, something like this may work: > > struct Buffer { > data: long; > length: int; > } > > table ArrayNode { > length: int; > null_count: int; > > /// The number of buffers would be inferred from the node type and actual > /// null_count > buffers: [Buffer]; > } > > table RowBatch { > /// Nodes correspond to the depth-first flattened logical schema > nodes: [ArrayNode]; > } > > So in an example schema: > > row batch schema { > f0: List<List<UInt8>>, > f1: Int32 > } > > The flattened version of this schema contains 4 ArrayNodes, 3 for the > nested f0 column, and 1 for the flat f1 column. > > - Wes > > > > > > > On Mon, Feb 29, 2016 at 6:13 PM, Wes McKinney <w...@cloudera.com> wrote: > > > >> hello all, > >> > >> I wanted to kick-start the process of coming up with a standardized / > >> canonical metadata specification that we can use for describing Arrow > >> data to be moved between systems. This breaks down into at least two > >> distinct kinds of metadata > >> > >> 1) "Schemas": physical types, logical types, child array types, struct > >> field names, and so forth. Does not contain information about the size > >> of the actual physical data (which depends on the length of arrays and > >> the sizes of list/variable-length type dimensions). > >> > >> 2) "Data headers": a description of the shape of a physical chunk of > >> data associated with a particular schema. Array length, null count, > >> memory buffer offsets and sizes, etc. This is the information you need > >> to compute the right pointers into a shared memory region or IPC/RPC > >> buffer and reconstruct Arrow container classes. > >> > >> Since #2 will depend on some of the details of #1, I suggest we start > >> figuring out #1 first. As far as the type metadata is concerned, to > >> avoid excess bike shedding we should break that problem into: > >> > >> A) The general layout of the type metadata / schemas > >> B) The technology we use for representing the schemas (and data > >> headers) in an implementation-independent way for use in an IPC/RPC > >> setting (and even to "store" ephemeral data on disk) > >> > >> On Item B, from what I've seen with Parquet and such file formats with > >> embedded metadata, and in the spirit of Arrow's "deserialize-nothing" > >> ethos, I suggest we explore no-deserialization technologies like > >> Google's Flatbuffers (https://github.com/google/flatbuffers) as a more > >> CPU-efficient alternative to Thrift, Protobuf, or Avro. In large > >> schemas, technologies like Thrift can result in significant overhead > >> in "needle-in-haystack" problems where you are picking only a few > >> columns out of very wide tables (> 1000s of columns), and it may be > >> best to try to avoid this if at all possible. > >> > >> I would like some help stewarding the design process on this from the > >> Arrow PMC and in particular those who have worked on the design and > >> implementation of Parquet and other file formats and systems for which > >> Arrow is an immediate intended companion. Lot of things we can learn > >> from those past experiences. > >> > >> Thank you, > >> Wes > >> >