Thanks Wes. This makes sense. Best, Chao
On Sun, Mar 6, 2016 at 2:01 PM, Wes McKinney <w...@cloudera.com> wrote: > Chao -- yes, I believe you're right. > > In an IPC message, if the node is required or has null_count == 0, > then the null bitmap can be omitted from the payload, but otherwise > it's there (this is similar to the Hive HS2 V6 protocol). > > One alternative to this is that rather than omitting the bitmap, its > data header metadata can be included but the buffer size is zero. > > - Wes > > On Sun, Mar 6, 2016 at 12:33 PM, Chao Sun <sunc...@apache.org> wrote: > > Hi, > > > > Sorry if this is a n00b question. For the example that Jacques used in > the > > previous thread, how does it work if the struct is nullable - shouldn't > > there > > be a is_null array between 1 and 2? > > > > For instance, with the example: > > > > list elem #0: { <f0: aaa, f1: 42>, null, <f0: null, f1: 28> } > > list elem #1: { <f0: bbb, f1: null> } > > list elem #2: null > > list elem #3: { <f0: null, f1: null> } > > > > what does the encoding look like? > > > > I'm thinking: (0 = false, 1 = true in is_null arrays): > > > > 0 (list is_null): 0 0 1 0 > > 1 (list offset): 0 3 4 4 5 > > 2 (struct is_null): 0 1 0 0 0 > > 3 (struct field f0 is_null): 0 1 0 1 > > 4 (struct field f0 offset): 0 3 3 6 6 > > 5 (struct field f0 value): a a a b b b > > 6 (struct field f1 is_null): 1 1 0 0 > > 7 (struct field f1 value): 42 28 > > > > Let me know if this is wrong. Thanks! > > > > Best, > > Chao > > > > On Tue, Mar 1, 2016 at 1:13 PM, Wes McKinney <w...@cloudera.com> wrote: > > > >> Inline responses > >> > >> On Tue, Mar 1, 2016 at 8:24 AM, Jacques Nadeau <jacq...@apache.org> > wrote: > >> > Wes, thanks for starting this conversation. > >> > > >> > Couple thoughts: > >> > > >> > For metadata, we have two models existing (one in the ValueVectors > >> approach > >> > and one in Parquet). It seems like we should start from one of those > and > >> > then shape as appropriate. It seems like we have a richer physical > >> > capability that the core Dremel algorithm that Parquet implements so I > >> > think it would make sense to focus first on the logical model and then > >> > figure out the shared physical that exists below that. > >> > > >> > While the Data Headers item (2) in your description may come logically > >> > second, I think that it greatly informs 1.B as I believe 2 is > something > >> > that should be an in-memory canonical representation (similar to the > >> > vectors themselves). I know Steven has been looking at moving the Java > >> > layer over to serialize the data headers using something similar to > this: > >> > > >> > Data headers use a deterministic pre-order "tree" ordering of the > memory > >> > buffers (https://en.wikipedia.org/wiki/Tree_traversal). The data > >> structures > >> > are simply an array of data headers consisting of a list of buffer > >> offsets > >> > and sizes. > >> > > >> > >> This makes sense, and is consistent with Parquet's depth-first > >> (pre-order) flattening of nested schemas into a flat > >> list<SchemaElement> > >> ( > >> > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L556 > >> ). > >> > >> > For example, consider this schema: > >> > > >> > List<Struct<String=List<UInt8>, Int32>> > >> > > >> > the pre-order buffer order is > >> > > >> > 0: nulls top level list > >> > > >> > 1: list offsets > >> > > >> > 2: struct field 0 nulls > >> > > >> > 3: struct field 0 list offsets > >> > > >> > 4: struct field 0 inner UInt8 values > >> > > >> > 5: struct field 1 nulls > >> > > >> > 6: struct field 1 Int32 values > >> > > >> > >> Regarding the buffers, we would want to embed the null count for each > >> logical array in some way, so that the null bitmap can be omitted. > >> This also spares having to compute the actual null counts when > >> receiving a row batch. > >> > >> For example, a List<T>, we could have a couple cases: > >> > >> List > >> - null_count > 0 -> 2 buffers, 1 for the null bitmap, 1 for the list > >> offsets > >> - null_count == 0 -> 1 buffer for the list offsets > >> > >> Recursively, the buffers for the type T child data would similarly > >> have its own null count, determining the actual number of memory > >> buffers. Let me know what you think. > >> > >> > The flatbuffer schema for the data header would then be: > >> > > >> > namespace DataHeaders; > >> > > >> > struct Buffer { > >> > > >> > data: long; > >> > > >> > length: int; > >> > > >> > } > >> > > >> > // Representing a single array (aka ValueVector), typically > >> > > >> > table BufferList { > >> > > >> > // With FBS it is not possible to know the length of an array > >> > > >> > n_buffers: int; > >> > > >> > buffers: [Buffer]; > >> > > >> > } > >> > > >> > // Multiple arrays -- could be used for long arrays or a > >> > > >> > // whole table row batch > >> > > >> > table ArrayBatch { > >> > > >> > n_arrays: int; > >> > > >> > arrays: [BufferList]; > >> > > >> > } > >> > >> I've been tinkering with flatbuffers, and array lengths are indeed > >> available in the generated IDL (see flatbuffers::Vector::size in the > >> C++ API, for example), so the n_buffers / n_arrays fields aren't > >> needed. > >> > >> Per the discussion above re: null counts, something like this may work: > >> > >> struct Buffer { > >> data: long; > >> length: int; > >> } > >> > >> table ArrayNode { > >> length: int; > >> null_count: int; > >> > >> /// The number of buffers would be inferred from the node type and > actual > >> /// null_count > >> buffers: [Buffer]; > >> } > >> > >> table RowBatch { > >> /// Nodes correspond to the depth-first flattened logical schema > >> nodes: [ArrayNode]; > >> } > >> > >> So in an example schema: > >> > >> row batch schema { > >> f0: List<List<UInt8>>, > >> f1: Int32 > >> } > >> > >> The flattened version of this schema contains 4 ArrayNodes, 3 for the > >> nested f0 column, and 1 for the flat f1 column. > >> > >> - Wes > >> > >> > > >> > > >> > On Mon, Feb 29, 2016 at 6:13 PM, Wes McKinney <w...@cloudera.com> > wrote: > >> > > >> >> hello all, > >> >> > >> >> I wanted to kick-start the process of coming up with a standardized / > >> >> canonical metadata specification that we can use for describing Arrow > >> >> data to be moved between systems. This breaks down into at least two > >> >> distinct kinds of metadata > >> >> > >> >> 1) "Schemas": physical types, logical types, child array types, > struct > >> >> field names, and so forth. Does not contain information about the > size > >> >> of the actual physical data (which depends on the length of arrays > and > >> >> the sizes of list/variable-length type dimensions). > >> >> > >> >> 2) "Data headers": a description of the shape of a physical chunk of > >> >> data associated with a particular schema. Array length, null count, > >> >> memory buffer offsets and sizes, etc. This is the information you > need > >> >> to compute the right pointers into a shared memory region or IPC/RPC > >> >> buffer and reconstruct Arrow container classes. > >> >> > >> >> Since #2 will depend on some of the details of #1, I suggest we start > >> >> figuring out #1 first. As far as the type metadata is concerned, to > >> >> avoid excess bike shedding we should break that problem into: > >> >> > >> >> A) The general layout of the type metadata / schemas > >> >> B) The technology we use for representing the schemas (and data > >> >> headers) in an implementation-independent way for use in an IPC/RPC > >> >> setting (and even to "store" ephemeral data on disk) > >> >> > >> >> On Item B, from what I've seen with Parquet and such file formats > with > >> >> embedded metadata, and in the spirit of Arrow's "deserialize-nothing" > >> >> ethos, I suggest we explore no-deserialization technologies like > >> >> Google's Flatbuffers (https://github.com/google/flatbuffers) as a > more > >> >> CPU-efficient alternative to Thrift, Protobuf, or Avro. In large > >> >> schemas, technologies like Thrift can result in significant overhead > >> >> in "needle-in-haystack" problems where you are picking only a few > >> >> columns out of very wide tables (> 1000s of columns), and it may be > >> >> best to try to avoid this if at all possible. > >> >> > >> >> I would like some help stewarding the design process on this from the > >> >> Arrow PMC and in particular those who have worked on the design and > >> >> implementation of Parquet and other file formats and systems for > which > >> >> Arrow is an immediate intended companion. Lot of things we can learn > >> >> from those past experiences. > >> >> > >> >> Thank you, > >> >> Wes > >> >> > >> >