hi Jacques, Let's definitely hammer out this bit on the string / binary types. In the Flatbuffers spec, they are already first-class types:
https://github.com/apache/arrow/blob/master/format/Message.fbs#L68 I agree that having a special primitive type for this common case (to avoid extra structure in IPC messages) is a reasonable tradeoff. You could obviously perform zero-copy conversions between UTF8/Binary and List<UInt8>. So here would be the C++-ish pseudocode for the new physical layout: struct Base { int32_t length; int32_t null_count; } struct UTF8 : public Base { int32_t* offsets; uint8_t* data; } struct UInt8 : public Base { uint8_t* values; } struct List<T> : public Base { int32_t* offsets; T* values; } Whereas before, UTF8 was merely an alias for List<UInt8>, now the nested structure is "collapsed" so that UTF8 (equivalently Binary) is primitive. No objections here; it should also make the string builder/array code path simpler. Micah, what are your thoughts having now wrangled with this code quite a bit? - Wes On Wed, Apr 13, 2016 at 10:54 AM, Jacques Nadeau <jacq...@apache.org> wrote: > I agree with everything Wes said. > > I'd also note that we need to address some issues with the definition of > the varchar(string) and varbinary(binary) types. They need to be considered > primitives rather than Arrow compositions. I think this will be clearer if > I get up a schema spec based on our initial proposal in the IPC stuff. We > should make sure the schema spec and the layout spec use the same base > concepts/definitions. > > On Wed, Apr 13, 2016 at 5:08 AM, Wes McKinney <w...@cloudera.com> wrote: > >> hi Micah, >> >> thank you for working through these details. >> >> On Sun, Apr 10, 2016 at 7:18 AM, Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> > As part of the pull request, I converted example arrays to show their >> > full layout in gory detail. Part of this led to some clarifications >> > to questions that people have apparently been asking about how null >> > flags interact between parent/child arrays and some cleanup in >> > language. I don't think these changes should be controversial but I'm >> > summarizing them here in case they are. >> > >> > * Offset arrays for lists and dense unions are now called offset >> > buffers, implying we don't track null values separately for them and >> > their sizes can be computed based on the array they are contained in. >> > Similarly the types array for unions is now called a types buffer for >> > the same reason. >> > >> >> Yes -- I think this is a good language clarification to make it clear >> that these pieces of data are not proper Arrow data structures. >> >> > * For structs and unions, a null struct or null union (e.g. the slot >> > on the parent array is marked null) does not imply that the >> > corresponding slot on the children arrays is empty. Similarly the >> > type for a slot on a sparse union, doesn't imply that the slots on the >> > children arrays that aren't of the type selected are null. This >> > facilitates construction of sparse unions and structs without >> > requiring copying/mutating the underlying arrays. >> > >> >> I agree with this also -- the rule of thumb is that the information in >> the parent node supercedes what is in the child nodes. So if there is >> a null at any level of the nested type tree, then anything below it is >> not strictly defined. >> >> > One open question that this leaves is for unions, is if a slot is >> > marked as null, should there be any contract imposed on the >> > corresponding offset and type value (e.g. should we make it -1. I >> > would vote no, but can see reasons for going both ways). >> >> My preference would be as above for the data in that slot to be >> "undefined", similar to regular arrays of integers. This isn't >> specified in the doc, I don't think, but the use case of layering a >> bitmap on preallocated memory to create an array with nulls is an >> important enough use case that we probably should not place any >> constraints on "masked" slots for any data types. >> >> best, >> Wes >> >> > >> > Thanks, >> > Micah >> > >> > P.S. This pull request also includes a summary of endianness but not >> > the other issues discussed on the most recent document layout thread >> > (I'm giving those a little bit more time for others to comment and >> > will open up a separate JIRA/Pull request to implement them). >>