I agree with everything Wes said. I'd also note that we need to address some issues with the definition of the varchar(string) and varbinary(binary) types. They need to be considered primitives rather than Arrow compositions. I think this will be clearer if I get up a schema spec based on our initial proposal in the IPC stuff. We should make sure the schema spec and the layout spec use the same base concepts/definitions.
On Wed, Apr 13, 2016 at 5:08 AM, Wes McKinney <w...@cloudera.com> wrote: > hi Micah, > > thank you for working through these details. > > On Sun, Apr 10, 2016 at 7:18 AM, Micah Kornfield <emkornfi...@gmail.com> > wrote: > > As part of the pull request, I converted example arrays to show their > > full layout in gory detail. Part of this led to some clarifications > > to questions that people have apparently been asking about how null > > flags interact between parent/child arrays and some cleanup in > > language. I don't think these changes should be controversial but I'm > > summarizing them here in case they are. > > > > * Offset arrays for lists and dense unions are now called offset > > buffers, implying we don't track null values separately for them and > > their sizes can be computed based on the array they are contained in. > > Similarly the types array for unions is now called a types buffer for > > the same reason. > > > > Yes -- I think this is a good language clarification to make it clear > that these pieces of data are not proper Arrow data structures. > > > * For structs and unions, a null struct or null union (e.g. the slot > > on the parent array is marked null) does not imply that the > > corresponding slot on the children arrays is empty. Similarly the > > type for a slot on a sparse union, doesn't imply that the slots on the > > children arrays that aren't of the type selected are null. This > > facilitates construction of sparse unions and structs without > > requiring copying/mutating the underlying arrays. > > > > I agree with this also -- the rule of thumb is that the information in > the parent node supercedes what is in the child nodes. So if there is > a null at any level of the nested type tree, then anything below it is > not strictly defined. > > > One open question that this leaves is for unions, is if a slot is > > marked as null, should there be any contract imposed on the > > corresponding offset and type value (e.g. should we make it -1. I > > would vote no, but can see reasons for going both ways). > > My preference would be as above for the data in that slot to be > "undefined", similar to regular arrays of integers. This isn't > specified in the doc, I don't think, but the use case of layering a > bitmap on preallocated memory to create an array with nulls is an > important enough use case that we probably should not place any > constraints on "masked" slots for any data types. > > best, > Wes > > > > > Thanks, > > Micah > > > > P.S. This pull request also includes a summary of endianness but not > > the other issues discussed on the most recent document layout thread > > (I'm giving those a little bit more time for others to comment and > > will open up a separate JIRA/Pull request to implement them). >