Re: Some minor points from ARROW-94 (https://github.com/apache/arrow/pull/58)

Wes McKinney Fri, 22 Apr 2016 12:16:22 -0700

hi Jacques,

Let's definitely hammer out this bit on the string / binary types. In
the Flatbuffers spec, they are already first-class types:


https://github.com/apache/arrow/blob/master/format/Message.fbs#L68

I agree that having a special primitive type for this common case (to
avoid extra structure in IPC messages) is a reasonable tradeoff. You
could obviously perform zero-copy conversions between UTF8/Binary and
List<UInt8>.

So here would be the C++-ish pseudocode for the new physical layout:

struct Base {
  int32_t length;
  int32_t null_count;
}

struct UTF8 : public Base {
  int32_t* offsets;
  uint8_t* data;
}

struct UInt8 : public Base {
  uint8_t* values;
}

struct List<T> : public Base {
  int32_t* offsets;
  T* values;
}

Whereas before, UTF8 was merely an alias for List<UInt8>, now the
nested structure is "collapsed" so that UTF8 (equivalently Binary) is
primitive.

No objections here; it should also make the string builder/array code
path simpler. Micah, what are your thoughts having now wrangled with
this code quite a bit?

- Wes

On Wed, Apr 13, 2016 at 10:54 AM, Jacques Nadeau <[email protected]> wrote:
> I agree with everything Wes said.
>
> I'd also note that we need to address some issues with the definition of
> the varchar(string) and varbinary(binary) types. They need to be considered
> primitives rather than Arrow compositions. I think this will be clearer if
> I get up a schema spec based on our initial proposal in the IPC stuff. We
> should make sure the schema spec and the layout spec use the same base
> concepts/definitions.
>
> On Wed, Apr 13, 2016 at 5:08 AM, Wes McKinney <[email protected]> wrote:
>
>> hi Micah,
>>
>> thank you for working through these details.
>>
>> On Sun, Apr 10, 2016 at 7:18 AM, Micah Kornfield <[email protected]>
>> wrote:
>> > As part of the pull request, I converted example arrays to show their
>> > full layout in gory detail.  Part of this led to some clarifications
>> > to questions that people have apparently been asking about how null
>> > flags interact between parent/child arrays and some cleanup in
>> > language.  I don't think these changes should be controversial but I'm
>> > summarizing them here in case they are.
>> >
>> > * Offset arrays for lists and dense unions are now called offset
>> > buffers, implying we don't track null values separately for them and
>> > their sizes can be computed based on the array they are contained in.
>> >  Similarly the types array for unions is now called a types buffer for
>> > the same reason.
>> >
>>
>> Yes -- I think this is a good language clarification to make it clear
>> that these pieces of data are not proper Arrow data structures.
>>
>> > * For structs and unions, a null struct or null union (e.g. the slot
>> > on the parent array is marked null) does not imply that the
>> > corresponding slot on the children arrays is empty.  Similarly the
>> > type for a slot on a sparse union, doesn't imply that the slots on the
>> > children arrays that aren't of the type selected are null.  This
>> > facilitates construction of sparse unions and structs without
>> > requiring copying/mutating the underlying arrays.
>> >
>>
>> I agree with this also -- the rule of thumb is that the information in
>> the parent node supercedes what is in the child nodes. So if there is
>> a null at any level of the nested type tree, then anything below it is
>> not strictly defined.
>>
>> > One open question that this leaves is for unions, is if a slot is
>> > marked as null, should there be any contract imposed on the
>> > corresponding offset and type value (e.g. should we make it -1.  I
>> > would vote no, but can see reasons for going both ways).
>>
>> My preference would be as above for the data in that slot to be
>> "undefined", similar to regular arrays of integers. This isn't
>> specified in the doc, I don't think, but the use case of layering a
>> bitmap on preallocated memory to create an array with nulls is an
>> important enough use case that we probably should not place any
>> constraints on "masked" slots for any data types.
>>
>> best,
>> Wes
>>
>> >
>> > Thanks,
>> > Micah
>> >
>> > P.S. This pull request also includes a summary of endianness but not
>> > the other issues discussed on the most recent document layout thread
>> > (I'm giving those a little bit more time for others to comment and
>> > will open up a separate JIRA/Pull request to implement them).
>>

Re: Some minor points from ARROW-94 (https://github.com/apache/arrow/pull/58)

Reply via email to