Re: Cross-implementation metadata specification, IPC details

Chao Sun Sun, 06 Mar 2016 12:33:51 -0800

Hi,

Sorry if this is a n00b question. For the example that Jacques used in the
previous thread, how does it work if the struct is nullable - shouldn't
there
be a is_null array between 1 and 2?


For instance, with the example:

list elem #0: { <f0: aaa, f1: 42>, null, <f0: null, f1: 28> }
list elem #1: { <f0: bbb, f1: null> }
list elem #2: null
list elem #3: { <f0: null, f1: null> }

what does the encoding look like?

I'm thinking: (0 = false, 1 = true in is_null arrays):

0 (list is_null): 0 0 1 0
1 (list offset): 0 3 4 4 5
2 (struct is_null): 0 1 0 0 0
3 (struct field f0 is_null): 0 1 0 1
4 (struct field f0 offset): 0 3 3 6 6
5 (struct field f0 value): a a a b b b
6 (struct field f1 is_null): 1 1 0 0
7 (struct field f1 value): 42 28

Let me know if this is wrong. Thanks!

Best,
Chao

On Tue, Mar 1, 2016 at 1:13 PM, Wes McKinney <w...@cloudera.com> wrote:

> Inline responses
>
> On Tue, Mar 1, 2016 at 8:24 AM, Jacques Nadeau <jacq...@apache.org> wrote:
> > Wes, thanks for starting this conversation.
> >
> > Couple thoughts:
> >
> > For metadata, we have two models existing (one in the ValueVectors
> approach
> > and one in Parquet). It seems like we should start from one of those and
> > then shape as appropriate. It seems like we have a richer physical
> > capability that the core Dremel algorithm that Parquet implements so I
> > think it would make sense to focus first on the logical model and then
> > figure out the shared physical that exists below that.
> >
> > While the Data Headers item (2) in your description may come logically
> > second, I think that it greatly informs 1.B as I believe 2 is something
> > that should be an in-memory canonical representation (similar to the
> > vectors themselves). I know Steven has been looking at moving the Java
> > layer over to serialize the data headers using something similar to this:
> >
> > Data headers use a deterministic pre-order "tree" ordering of the memory
> > buffers (https://en.wikipedia.org/wiki/Tree_traversal). The data
> structures
> > are simply an array of data headers consisting of a list of buffer
> offsets
> > and sizes.
> >
>
> This makes sense, and is consistent with Parquet's depth-first
> (pre-order) flattening of nested schemas into a flat
> list<SchemaElement>
> (
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L556
> ).
>
> > For example, consider this schema:
> >
> > List<Struct<String=List<UInt8>, Int32>>
> >
> > the pre-order buffer order is
> >
> > 0: nulls top level list
> >
> > 1: list offsets
> >
> > 2: struct field 0 nulls
> >
> > 3: struct field 0 list offsets
> >
> > 4: struct field 0 inner UInt8 values
> >
> > 5: struct field 1 nulls
> >
> > 6: struct field 1 Int32 values
> >
>
> Regarding the buffers, we would want to embed the null count for each
> logical array in some way, so that the null bitmap can be omitted.
> This also spares having to compute the actual null counts when
> receiving a row batch.
>
> For example, a List<T>, we could have a couple cases:
>
> List
> - null_count > 0 -> 2 buffers, 1 for the null bitmap, 1 for the list
> offsets
> - null_count == 0 -> 1 buffer for the list offsets
>
> Recursively, the buffers for the type T child data would similarly
> have its own null count, determining the actual number of memory
> buffers. Let me know what you think.
>
> > The flatbuffer schema for the data header would then be:
> >
> > namespace DataHeaders;
> >
> > struct Buffer {
> >
> >  data: long;
> >
> >  length: int;
> >
> > }
> >
> > // Representing a single array (aka ValueVector), typically
> >
> > table BufferList {
> >
> >  // With FBS it is not possible to know the length of an array
> >
> >  n_buffers: int;
> >
> >  buffers: [Buffer];
> >
> > }
> >
> > // Multiple arrays -- could be used for long arrays or a
> >
> > // whole table row batch
> >
> > table ArrayBatch {
> >
> >  n_arrays: int;
> >
> >  arrays: [BufferList];
> >
> > }
>
> I've been tinkering with flatbuffers, and array lengths are indeed
> available in the generated IDL (see flatbuffers::Vector::size in the
> C++ API, for example), so the n_buffers / n_arrays fields aren't
> needed.
>
> Per the discussion above re: null counts, something like this may work:
>
> struct Buffer {
>  data: long;
>  length: int;
> }
>
> table ArrayNode {
>   length: int;
>   null_count: int;
>
>   /// The number of buffers would be inferred from the node type and actual
>   /// null_count
>   buffers: [Buffer];
> }
>
> table RowBatch {
>   /// Nodes correspond to the depth-first flattened logical schema
>   nodes: [ArrayNode];
> }
>
> So in an example schema:
>
> row batch schema {
>   f0: List<List<UInt8>>,
>   f1: Int32
> }
>
> The flattened version of this schema contains 4 ArrayNodes, 3 for the
> nested f0 column, and 1 for the flat f1 column.
>
> - Wes
>
> >
> >
> > On Mon, Feb 29, 2016 at 6:13 PM, Wes McKinney <w...@cloudera.com> wrote:
> >
> >> hello all,
> >>
> >> I wanted to kick-start the process of coming up with a standardized /
> >> canonical metadata specification that we can use for describing Arrow
> >> data to be moved between systems. This breaks down into at least two
> >> distinct kinds of metadata
> >>
> >> 1) "Schemas": physical types, logical types, child array types, struct
> >> field names, and so forth. Does not contain information about the size
> >> of the actual physical data (which depends on the length of arrays and
> >> the sizes of list/variable-length type dimensions).
> >>
> >> 2) "Data headers": a description of the shape of a physical chunk of
> >> data associated with a particular schema. Array length, null count,
> >> memory buffer offsets and sizes, etc. This is the information you need
> >> to compute the right pointers into a shared memory region or IPC/RPC
> >> buffer and reconstruct Arrow container classes.
> >>
> >> Since #2 will depend on some of the details of #1, I suggest we start
> >> figuring out #1 first. As far as the type metadata is concerned, to
> >> avoid excess bike shedding we should break that problem into:
> >>
> >> A) The general layout of the type metadata / schemas
> >> B) The technology we use for representing the schemas (and data
> >> headers) in an implementation-independent way for use in an IPC/RPC
> >> setting (and even to "store" ephemeral data on disk)
> >>
> >> On Item B, from what I've seen with Parquet and such file formats with
> >> embedded metadata, and in the spirit of Arrow's "deserialize-nothing"
> >> ethos, I suggest we explore no-deserialization technologies like
> >> Google's Flatbuffers (https://github.com/google/flatbuffers) as a more
> >> CPU-efficient alternative to Thrift, Protobuf, or Avro. In large
> >> schemas, technologies like Thrift can result in significant overhead
> >> in "needle-in-haystack" problems where you are picking only a few
> >> columns out of very wide tables (> 1000s of columns), and it may be
> >> best to try to avoid this if at all possible.
> >>
> >> I would like some help stewarding the design process on this from the
> >> Arrow PMC and in particular those who have worked on the design and
> >> implementation of Parquet and other file formats and systems for which
> >> Arrow is an immediate intended companion. Lot of things we can learn
> >> from those past experiences.
> >>
> >> Thank you,
> >> Wes
> >>
>

Re: Cross-implementation metadata specification, IPC details

Reply via email to