I see the primary point of discussion on this to be whether String/Binary have the same layout on the wire as List<uint8-not null> (i.e. one Field/Node in the type tree versus two). I think what we are working towards is a single field rather than a List field and an Int field (bit width 8).
> On Aug 10, 2016, at 10:46 AM, Julien Le Dem <jul...@dremio.com> wrote: > > Hi, > Agreed. > To paraphrase/complement what has been said: > The types in format/Message.fbs [1] are "Logical types" or "user facing > types", close to SQL types (they include String, Timestamp, Decimal, ...) and > are related to Parquet's logical types [2][3]. > For each of those types there's a corresponding physical layout that is > formally specified (example discussed here: String => List<UInt8-not null>). > I'm going to open a couple of JIRA's to finalise the types and clarify the > layout. > > [1] > https://github.com/apache/arrow/blob/34e7f48cb71428c4d78cf00d8fdf0045532d6607/format/Message.fbs#L63 > [2] https://github.com/apache/parquet-format/blob/master/LogicalTypes.md > [3] > https://github.com/apache/parquet-format/blob/66a5a7b982e291e06afb1da7ffe9da211318caba/src/main/thrift/parquet.thrift#L48 > > Julien > >> On Tue, Aug 9, 2016 at 4:20 PM, Wes McKinney <wesmck...@gmail.com> wrote: >> hi Micah >> >> I'm sorry for dropping the ball on this discussion. copying Julien as >> he's been looking at the metadata recently. >> >> My thinking is that we should indicate in the format document that the >> String and Binary logical types, as a matter of cross-implementation >> convention, will have List<UInt8-not null> memory layout. >> >> In the C++ library at least, we can collapse the class structure to >> make BinaryArray and StringArray not a subclass of ListArray, >> factoring out common code that can be reused into helper inline >> functions. >> >> Class hierarchy aside the main impact is adding entries to the Type >> union in the Flatbuffers metadata >> https://github.com/apache/arrow/blob/master/format/Message.fbs#L63 >> >> Accordingly, in the metadata and in RPC/IPC scenarios, binary/string >> would be a single array unit in the buffer stream and flattened Field >> metadata rather than nested types (2 array units as they are >> presently). >> >> Separately, I am very interested in discussing a form of logical >> Binary/StringArray in the C++ implementation that is internally >> dictionary encoded. I'm proposing this as a possible new UTF-8 >> representation for pandas in the future: >> https://wesm.github.io/pandas2-design/strings.html#possible-solution-new-non-numpy-string-memory-layout >> >> Hopefully this isn't too incoherent, but it would be good to arrive at >> some conclusion in this discussion if we need to implement the >> changes. >> >> Thanks >> Wes >> >> On Tue, Jul 26, 2016 at 10:09 PM, Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> > Wes, Jacques, others... >> > >> > Any thoughts on this? Let me know if you would like to clarify something, >> > I think I was a little long winded. It would be good to come to a >> > consensus one way or another. >> > >> > Thanks, >> > Micah >> > >> > On Sun, Jul 17, 2016 at 1:43 PM, Micah Kornfield <emkornfi...@gmail.com> >> > wrote: >> > >> >> Hi Wes and Jacques, >> >> >> >> Thanks for the thorough analysis. I agree that Strings should be easy to >> >> work with. I'm just trying to understand how making a distinct string >> >> type >> >> defined in the memory layout spec [1] brings a lot of additional utility. >> >> >> >> I think of there being two distinct concerns with Arrow: >> >> >> >> 1. Layout - What metadata and data elements are required to represent a >> >> specific type in a flat address space. >> >> >> >> 2. Manipulation - How we build interfaces for working with the memory >> >> layout. >> >> >> >> With respect to Memory Layout, introducing a new string type seems to add >> >> redundancy. As Wes noted, List<uint8 [not-null]> is sufficient to >> >> represent the layout for strings. So the main benefits for introducing a >> >> new memory layout for a string type is an optimization. By introducing >> >> the >> >> new type we avoid invalid string construction (having uint_t elements >> >> marked as null in the nested array) and to save a few bytes/extra function >> >> call when "(de)serializing" a string column. >> >> >> >> With respect to manipulation, I agree, that having the right API/modeling >> >> to treat strings as first class objects makes a lot of sense. But I >> >> don't >> >> think that the specification needs to explicitly make allowances for it. >> >> Once you have constructed a Java/C++ wrapper around the memory layout you >> >> can choose to expose the right convenience APIs through OO abstraction. >> >> The construction of the correct object wrapper is governed by Metadata >> >> defined in [2] and an understanding of how the logical type maps to the >> >> appropriate memory layout. At the moment metadata doesn't specify any >> >> sort >> >> of class hierarchy which I believe is the correct thing to do from a >> >> specification perspective. >> >> >> >> The C++ implementation currently has StringArrays inheriting from a >> >> ListArrays which was an implementation convenience and something we should >> >> revisit (I agree with Wes's point on not relying on C++'s type system for >> >> casting). >> >> The primary argument for changing the existing implementation seems to be >> >> that strings should be considered "non-nested" types. Whether strings are >> >> nested or not seems to fall squarely into the manipulation concern (except >> >> for the optimizations mentioned above) and therefore, IMO, an >> >> implementation detail. When thinking about how this plays out in code >> >> I imagine a visitor pattern. I've provided some pseudo-code below for two >> >> possible visitor classes make StringArrays first class objects but >> >> wouldn't >> >> require updates to the specification. >> >> >> >> I've tried to think where testing a particular object for "nested"-ness >> >> makes sense by itself and couldn't come up with something off the top of >> >> my >> >> head. It seems once you determine an Array is non-nested you still want >> >> to >> >> test for exact primitive type you are dealing with. >> >> >> >> Given these points I'm still ambivalent about adding a new string/binary >> >> type to the spec. It would be an improvement but it seems like a somewhat >> >> minor improvement. If people can provide stronger use-cases for adding >> >> the >> >> new type I'd be less ambivalent, but at the moment this seems like more of >> >> an implementation concern. >> >> >> >> Thanks, >> >> Micah >> >> >> >> // Visitor patterns for arrays, that do not require any updates to the >> >> memory layout. >> >> class ClassVisitor { >> >> void visit(Int32Array ); >> >> void visit(UInt32Array ); >> >> void visit(DoubleArray ); >> >> void visit(ListArray ); >> >> void visit(StringArray ); // if we changed the hierarchy, this would >> >> be sufficient to treat strings as a first class type >> >> // Other types elided >> >> } >> >> >> >> or >> >> >> >> class BufferVisitor { // type disambiguation happens by calling the >> >> correctly >> >> // overloaded method >> >> void visit_numeric(TypeMedata, null_bitmap, value_buffer); >> >> void visit_list(TypeMedata, null_bitmap, offset_buffer, Array >> >> nested_type); >> >> void visit_string(TypeMetadata, null_bitmap, offset_buffer, >> >> byte_buffer); // sufficient for treating string types as non-nested. >> >> // Other types elided. >> >> } >> >> >> >> [1] https://github.com/apache/arrow/blob/master/format/Layout.md >> >> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs >> >> >> >> > > > > -- > Julien