Re: Discussion: Should we make string/binary types first class Arrow Array types?

Wes McKinney Wed, 10 Aug 2016 11:26:05 -0700

I see the primary point of discussion on this to be whether String/Binary have 
the same layout on the wire as List<uint8-not null> (i.e. one Field/Node in the 
type tree versus two). I think what we are working towards is a single field 
rather than a List field and an Int field (bit width 8).


> On Aug 10, 2016, at 10:46 AM, Julien Le Dem <[email protected]> wrote:
> 
> Hi,
> Agreed. 
> To paraphrase/complement what has been said:
> The types in format/Message.fbs [1] are "Logical types" or "user facing 
> types", close to SQL types (they include String, Timestamp, Decimal, ...) and 
> are related to Parquet's logical types [2][3].
> For each of those types there's a corresponding physical layout that is 
> formally specified (example discussed here: String => List<UInt8-not null>).
> I'm going to open a couple of JIRA's to finalise the types and clarify the 
> layout.
> 
> [1] 
> https://github.com/apache/arrow/blob/34e7f48cb71428c4d78cf00d8fdf0045532d6607/format/Message.fbs#L63
> [2] https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
> [3] 
> https://github.com/apache/parquet-format/blob/66a5a7b982e291e06afb1da7ffe9da211318caba/src/main/thrift/parquet.thrift#L48
> 
> Julien
> 
>> On Tue, Aug 9, 2016 at 4:20 PM, Wes McKinney <[email protected]> wrote:
>> hi Micah
>> 
>> I'm sorry for dropping the ball on this discussion. copying Julien as
>> he's been looking at the metadata recently.
>> 
>> My thinking is that we should indicate in the format document that the
>> String and Binary logical types, as a matter of cross-implementation
>> convention, will have List<UInt8-not null> memory layout.
>> 
>> In the C++ library at least, we can collapse the class structure to
>> make BinaryArray and StringArray not a subclass of ListArray,
>> factoring out common code that can be reused into helper inline
>> functions.
>> 
>> Class hierarchy aside the main impact is adding entries to the Type
>> union in the Flatbuffers metadata
>> https://github.com/apache/arrow/blob/master/format/Message.fbs#L63
>> 
>> Accordingly, in the metadata and in RPC/IPC scenarios, binary/string
>> would be a single array unit in the buffer stream and flattened Field
>> metadata rather than nested types (2 array units as they are
>> presently).
>> 
>> Separately, I am very interested in discussing a form of logical
>> Binary/StringArray in the C++ implementation that is internally
>> dictionary encoded. I'm proposing this as a possible new UTF-8
>> representation for pandas in the future:
>> https://wesm.github.io/pandas2-design/strings.html#possible-solution-new-non-numpy-string-memory-layout
>> 
>> Hopefully this isn't too incoherent, but it would be good to arrive at
>> some conclusion in this discussion if we need to implement the
>> changes.
>> 
>> Thanks
>> Wes
>> 
>> On Tue, Jul 26, 2016 at 10:09 PM, Micah Kornfield <[email protected]> 
>> wrote:
>> > Wes, Jacques, others...
>> >
>> > Any thoughts on this?   Let me know if you would like to clarify something,
>> > I think I was a little long winded.  It would be good to come to a
>> > consensus one way or another.
>> >
>> > Thanks,
>> > Micah
>> >
>> > On Sun, Jul 17, 2016 at 1:43 PM, Micah Kornfield <[email protected]>
>> > wrote:
>> >
>> >> Hi Wes and Jacques,
>> >>
>> >> Thanks for the thorough analysis.  I agree that Strings should be easy to
>> >> work with.  I'm just trying to understand how making a distinct string 
>> >> type
>> >> defined in the memory layout spec [1] brings a lot of additional utility.
>> >>
>> >> I think of there being two distinct concerns with Arrow:
>> >>
>> >> 1.  Layout - What metadata and data elements are required to represent a
>> >> specific type in a flat address space.
>> >>
>> >> 2.  Manipulation - How we build interfaces for working with the memory
>> >> layout.
>> >>
>> >> With respect to Memory Layout, introducing a new string type seems to add
>> >> redundancy.  As Wes noted, List<uint8 [not-null]> is sufficient to
>> >> represent the layout for strings.  So the main benefits for introducing a
>> >> new memory layout for a string type is an optimization.  By introducing 
>> >> the
>> >> new type we avoid invalid string construction (having uint_t elements
>> >> marked as null in the nested array) and to save a few bytes/extra function
>> >> call when "(de)serializing" a string column.
>> >>
>> >> With respect to manipulation, I agree, that having the right API/modeling
>> >> to treat strings as first class objects makes a lot of sense.   But I 
>> >> don't
>> >> think that the specification needs to explicitly make allowances for it.
>> >> Once you have constructed a Java/C++ wrapper around the memory layout you
>> >> can choose to expose the right convenience APIs through OO abstraction.
>> >> The construction of the correct object wrapper is governed by Metadata
>> >> defined in [2] and an understanding of how the logical type maps to the
>> >> appropriate memory layout.  At the moment metadata doesn't specify any 
>> >> sort
>> >> of class hierarchy which I believe is the correct thing to do from a
>> >> specification perspective.
>> >>
>> >> The C++ implementation currently has StringArrays inheriting from a
>> >> ListArrays which was an implementation convenience and something we should
>> >> revisit (I agree with Wes's point on not relying on  C++'s type system for
>> >> casting).
>> >> The primary argument for changing the existing implementation seems to be
>> >> that strings should be considered "non-nested" types.  Whether strings are
>> >> nested or not seems to fall squarely into the manipulation concern (except
>> >> for the optimizations mentioned above) and therefore, IMO, an
>> >> implementation detail.     When thinking about how this plays out in code
>> >> I imagine a visitor pattern.  I've provided some pseudo-code below for two
>> >> possible visitor classes make StringArrays first class objects but 
>> >> wouldn't
>> >> require updates to the specification.
>> >>
>> >> I've tried to think where testing a particular object for "nested"-ness
>> >> makes sense by itself and couldn't come up with something off the top of 
>> >> my
>> >> head.  It seems once you determine an Array is non-nested you still want 
>> >> to
>> >> test for exact primitive type you are dealing with.
>> >>
>> >> Given these points I'm still ambivalent about adding a new string/binary
>> >> type to the spec. It would be an improvement but it seems like a somewhat
>> >> minor improvement.  If people can provide stronger use-cases for adding 
>> >> the
>> >> new type I'd be less ambivalent, but at the moment this seems like more of
>> >> an implementation concern.
>> >>
>> >> Thanks,
>> >> Micah
>> >>
>> >> // Visitor patterns for arrays, that do not require any updates to the
>> >> memory layout.
>> >> class ClassVisitor {
>> >>     void visit(Int32Array );
>> >>     void visit(UInt32Array );
>> >>     void visit(DoubleArray );
>> >>     void visit(ListArray );
>> >>     void visit(StringArray ); // if we changed the hierarchy, this would
>> >> be sufficient to treat strings as a first class type
>> >>     // Other types elided
>> >> }
>> >>
>> >> or
>> >>
>> >> class BufferVisitor { // type disambiguation happens by calling the
>> >> correctly
>> >>                                 // overloaded method
>> >>     void visit_numeric(TypeMedata, null_bitmap, value_buffer);
>> >>     void visit_list(TypeMedata, null_bitmap, offset_buffer, Array
>> >> nested_type);
>> >>     void visit_string(TypeMetadata, null_bitmap, offset_buffer,
>> >> byte_buffer); // sufficient for treating string types as non-nested.
>> >>     // Other types elided.
>> >> }
>> >>
>> >> [1] https://github.com/apache/arrow/blob/master/format/Layout.md
>> >> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs
>> >>
>> >>
> 
> 
> 
> -- 
> Julien

Re: Discussion: Should we make string/binary types first class Arrow Array types?

Reply via email to