I like the current scheme of making String (UTF8) a primitive type in
regards to RPC but not modeling it as a special Array type.  I think
the key is formally describing how logical types map to physical types
either is the Flatbuffer schema or in a separate document.

I think there are two use-cases here:
1.  Reconstructing Array's off the wire.
2.  Writing algorithms/builders to deal with specific logical types
built on Arrays.

For case 1, I think it is simpler to not special case string types as
primitives.  Understanding that a logical String type maps to a
List<Utf8> should be sufficient and allows us to re-use the
serialization code for ListArrays for these types.

For case 2, it would be nice to utilize the type system of the host
programming language to express the semantics of a function call (e.g.
ParseString(StringArray strings) vs ParseString(ListArray strings),
but I think this can be implemented without requiring a new primitive
type in the spec.

The more interesting thing to me is if we should have a new primitive
type for fixed length lists (e.g. the logical type CHAR).   The
offsets array isn't necessary in this case for random access.

Also, the way the VARCHAR types (based on a comment in the C++
(https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L63)
are currently described as a null terminated UTF8 is problematic.  I
believe null bytes are valid UTF8 characters.


On Fri, Apr 22, 2016 at 12:14 PM, Wes McKinney <w...@cloudera.com> wrote:
> hi Jacques,
>
> Let's definitely hammer out this bit on the string / binary types. In
> the Flatbuffers spec, they are already first-class types:
>
> https://github.com/apache/arrow/blob/master/format/Message.fbs#L68
>
> I agree that having a special primitive type for this common case (to
> avoid extra structure in IPC messages) is a reasonable tradeoff. You
> could obviously perform zero-copy conversions between UTF8/Binary and
> List<UInt8>.
>
> So here would be the C++-ish pseudocode for the new physical layout:
>
> struct Base {
>   int32_t length;
>   int32_t null_count;
> }
>
> struct UTF8 : public Base {
>   int32_t* offsets;
>   uint8_t* data;
> }
>
> struct UInt8 : public Base {
>   uint8_t* values;
> }
>
> struct List<T> : public Base {
>   int32_t* offsets;
>   T* values;
> }
>
> Whereas before, UTF8 was merely an alias for List<UInt8>, now the
> nested structure is "collapsed" so that UTF8 (equivalently Binary) is
> primitive.
>
> No objections here; it should also make the string builder/array code
> path simpler. Micah, what are your thoughts having now wrangled with
> this code quite a bit?
>
> - Wes
>
> On Wed, Apr 13, 2016 at 10:54 AM, Jacques Nadeau <jacq...@apache.org> wrote:
>> I agree with everything Wes said.
>>
>> I'd also note that we need to address some issues with the definition of
>> the varchar(string) and varbinary(binary) types. They need to be considered
>> primitives rather than Arrow compositions. I think this will be clearer if
>> I get up a schema spec based on our initial proposal in the IPC stuff. We
>> should make sure the schema spec and the layout spec use the same base
>> concepts/definitions.
>>
>> On Wed, Apr 13, 2016 at 5:08 AM, Wes McKinney <w...@cloudera.com> wrote:
>>
>>> hi Micah,
>>>
>>> thank you for working through these details.
>>>
>>> On Sun, Apr 10, 2016 at 7:18 AM, Micah Kornfield <emkornfi...@gmail.com>
>>> wrote:
>>> > As part of the pull request, I converted example arrays to show their
>>> > full layout in gory detail.  Part of this led to some clarifications
>>> > to questions that people have apparently been asking about how null
>>> > flags interact between parent/child arrays and some cleanup in
>>> > language.  I don't think these changes should be controversial but I'm
>>> > summarizing them here in case they are.
>>> >
>>> > * Offset arrays for lists and dense unions are now called offset
>>> > buffers, implying we don't track null values separately for them and
>>> > their sizes can be computed based on the array they are contained in.
>>> >  Similarly the types array for unions is now called a types buffer for
>>> > the same reason.
>>> >
>>>
>>> Yes -- I think this is a good language clarification to make it clear
>>> that these pieces of data are not proper Arrow data structures.
>>>
>>> > * For structs and unions, a null struct or null union (e.g. the slot
>>> > on the parent array is marked null) does not imply that the
>>> > corresponding slot on the children arrays is empty.  Similarly the
>>> > type for a slot on a sparse union, doesn't imply that the slots on the
>>> > children arrays that aren't of the type selected are null.  This
>>> > facilitates construction of sparse unions and structs without
>>> > requiring copying/mutating the underlying arrays.
>>> >
>>>
>>> I agree with this also -- the rule of thumb is that the information in
>>> the parent node supercedes what is in the child nodes. So if there is
>>> a null at any level of the nested type tree, then anything below it is
>>> not strictly defined.
>>>
>>> > One open question that this leaves is for unions, is if a slot is
>>> > marked as null, should there be any contract imposed on the
>>> > corresponding offset and type value (e.g. should we make it -1.  I
>>> > would vote no, but can see reasons for going both ways).
>>>
>>> My preference would be as above for the data in that slot to be
>>> "undefined", similar to regular arrays of integers. This isn't
>>> specified in the doc, I don't think, but the use case of layering a
>>> bitmap on preallocated memory to create an array with nulls is an
>>> important enough use case that we probably should not place any
>>> constraints on "masked" slots for any data types.
>>>
>>> best,
>>> Wes
>>>
>>> >
>>> > Thanks,
>>> > Micah
>>> >
>>> > P.S. This pull request also includes a summary of endianness but not
>>> > the other issues discussed on the most recent document layout thread
>>> > (I'm giving those a little bit more time for others to comment and
>>> > will open up a separate JIRA/Pull request to implement them).
>>>

Reply via email to