Re: Some minor points from ARROW-94 (https://github.com/apache/arrow/pull/58)

Micah Kornfield Wed, 25 May 2016 13:55:12 -0700

Just to follow-up on this.  I got distracted on a few other items on
the C++ implementation side, but my next task is to get a String types
working for the C++ IPC unit test.   Once I send a PR for that, it
might help clarify the concerns on both sides and we can hammer out
the details from there.


Sound reasonable?

-Micah

On Fri, May 13, 2016 at 10:33 AM, Wes McKinney <[email protected]> wrote:
> Nudging this issue. We need to sketch out a plan to get IPC
> integration tests working between the Java and C++ implementations --
> what's the most expedient way we can work toward making that happen?
>
> On Sun, May 1, 2016 at 1:02 AM, Micah Kornfield <[email protected]> wrote:
>> s/spark/slack/g
>>
>> On Sun, May 1, 2016 at 12:58 AM, Micah Kornfield <[email protected]> 
>> wrote:
>>> I'm not exactly sure of my availability if I am available on spark, I
>>> can likely make the hangout.
>>>
>>> On Fri, Apr 29, 2016 at 4:40 PM, Wes McKinney <[email protected]> wrote:
>>>> I was traveling today but I can do a hangout about this next week.
>>>>
>>>> On Thu, Apr 28, 2016 at 7:53 PM, Jacques Nadeau <[email protected]> wrote:
>>>>> Let's do a quick hangout on this. I'd like to better understand as I'm not
>>>>> sure we're all talking about the same thing.
>>>>>
>>>>> On Thu, Apr 28, 2016 at 5:30 PM, Micah Kornfield <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I'm -1 on making a new primitive type in the memory layout spec [1].
>>>>>>
>>>>>> +1 on clarifying [2], to indicate it is expected that the "Values
>>>>>> array" for Utf8 and Binary types should never contain null elements.
>>>>>>
>>>>>> [1] https://github.com/apache/arrow/blob/master/format/Layout.md
>>>>>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs
>>>>>>
>>>>>> On Thu, Apr 28, 2016 at 3:08 PM, Wes McKinney <[email protected]> wrote:
>>>>>> > Bumping this conversation.
>>>>>> >
>>>>>> > I'm +0 on making VARBINARY and String (identical VARBINARY but with a
>>>>>> > UTF8 guarantee) primitive types in the spec. Let me know what others
>>>>>> > think.
>>>>>> >
>>>>>> > Thanks
>>>>>> >
>>>>>> > On Fri, Apr 22, 2016 at 6:30 PM, Wes McKinney <[email protected]> 
>>>>>> > wrote:
>>>>>> >> On Fri, Apr 22, 2016 at 6:06 PM, Jacques Nadeau <[email protected]>
>>>>>> wrote:
>>>>>> >>> On Fri, Apr 22, 2016 at 2:42 PM, Wes McKinney <[email protected]>
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>>> On Fri, Apr 22, 2016 at 4:56 PM, Micah Kornfield <
>>>>>> [email protected]>
>>>>>> >>>> wrote:
>>>>>> >>>> > I like the current scheme of making String (UTF8) a primitive type
>>>>>> in
>>>>>> >>>> > regards to RPC but not modeling it as a special Array type.  I 
>>>>>> >>>> > think
>>>>>> >>>> > the key is formally describing how logical types map to physical
>>>>>> types
>>>>>> >>>> > either is the Flatbuffer schema or in a separate document.
>>>>>> >>>> >
>>>>>> >>>> > I think there are two use-cases here:
>>>>>> >>>> > 1.  Reconstructing Array's off the wire.
>>>>>> >>>> > 2.  Writing algorithms/builders to deal with specific logical 
>>>>>> >>>> > types
>>>>>> >>>> > built on Arrays.
>>>>>> >>>> >
>>>>>> >>>> > For case 1, I think it is simpler to not special case string types
>>>>>> as
>>>>>> >>>> > primitives.  Understanding that a logical String type maps to a
>>>>>> >>>> > List<Utf8> should be sufficient and allows us to re-use the
>>>>>> >>>> > serialization code for ListArrays for these types.
>>>>>> >>>> >
>>>>>> >>>>
>>>>>> >>>> It is simpler for the IPC serde code-path. I'll let Jacques comment
>>>>>> >>>> but one downside of having strings as a nested type is that there 
>>>>>> >>>> are
>>>>>> >>>> certain code paths (for example: Parquet-related) which deal with 
>>>>>> >>>> the
>>>>>> >>>> flat table case. To make a Parquet analogy, there is the special
>>>>>> >>>> BYTE_ARRAY primitive type, even though you could technically 
>>>>>> >>>> represent
>>>>>> >>>> variable-length binary data using a repeated field and using
>>>>>> >>>> repetition/definition levels (but the encoding/decoding overhead for
>>>>>> >>>> this in Parquet is much more significant than Arrow). There may be
>>>>>> >>>> other reasons.
>>>>>> >>>>
>>>>>> >>>
>>>>>> >>> I'm a bit confused about what everyone means. I didn't actually 
>>>>>> >>> realize
>>>>>> >>> that this [1] had been merged yet but I'm generally on board with how
>>>>>> it is
>>>>>> >>> constructed.
>>>>>> >>>
>>>>>> >>> With regards to the c++ implementation of the items at [1], 
>>>>>> >>> abstracting
>>>>>> >>> shared physical representations out seems fine to me but I don't 
>>>>>> >>> think
>>>>>> we
>>>>>> >>> should necessitate effective 3NF for [1].
>>>>>> >>>
>>>>>> >>> One of the key points that I'm focused on in the Java space is that 
>>>>>> >>> I'd
>>>>>> >>> like to move to an always nullable pattern. This is vastly 
>>>>>> >>> simplifying
>>>>>> from
>>>>>> >>> a code generation, casting and complexity perspective and is a 
>>>>>> >>> nominal
>>>>>> cost
>>>>>> >>> when using column execution. If binary and varchar are primitive 
>>>>>> >>> types
>>>>>> as
>>>>>> >>> there there is no weird special casing of avoiding the nullability
>>>>>> bitmap
>>>>>> >>> in the case of variable width items (for the offsets). But that is an
>>>>>> >>> implementation detail of the Java library.
>>>>>> >>>
>>>>>> >>> So in general, I like the scheme at [1] for the concepts that we all
>>>>>> are
>>>>>> >>> talking about (as opposed to eliminating lines 67 & 68)
>>>>>> >>>
>>>>>> >>> [1] https://github.com/apache/arrow/blob/master/format/Message.fbs
>>>>>> >>>
>>>>>> >>
>>>>>> >> Well, the issue is that mapping of metadata onto memory layout for IPC
>>>>>> >> purposes, at least. You can use the List code path for arbitrary List
>>>>>> >> types as well as strings and binary. It sounds like either way on the
>>>>>> >> Java side you're going to collapse UTF8 / BINARY into a primitive so
>>>>>> >> that you don't have to manage a separate never-used bitmap for the
>>>>>> >> string/binary data. It seems useful enough to me to have a primitive
>>>>>> >> variable-length binary/UTF8 type but I do not feel strongly about it.
>>>>>> >>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>> > For case 2, it would be nice to utilize the type system of the 
>>>>>> >>>> > host
>>>>>> >>>> > programming language to express the semantics of a function call
>>>>>> (e.g.
>>>>>> >>>> > ParseString(StringArray strings) vs ParseString(ListArray 
>>>>>> >>>> > strings),
>>>>>> >>>> > but I think this can be implemented without requiring a new
>>>>>> primitive
>>>>>> >>>> > type in the spec.
>>>>>> >>>> >
>>>>>> >>>> > The more interesting thing to me is if we should have a new
>>>>>> primitive
>>>>>> >>>> > type for fixed length lists (e.g. the logical type CHAR).   The
>>>>>> >>>> > offsets array isn't necessary in this case for random access.
>>>>>> >>>> >
>>>>>> >>>> > Also, the way the VARCHAR types (based on a comment in the C++
>>>>>> >>>> > (
>>>>>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L63)
>>>>>> >>>> > are currently described as a null terminated UTF8 is problematic. 
>>>>>> >>>> >  I
>>>>>> >>>> > believe null bytes are valid UTF8 characters.
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> >
>>>>>> >>>>
>>>>>> >>>> Good point, sorry about that. We probably would need to 
>>>>>> >>>> length-prefix
>>>>>> >>>> the values, then.
>>>>>> >>>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> Is this an input/output interface? Arrow structures should all be 4
>>>>>> byte
>>>>>> >>> offset based and be neither length prefixed nor null terminated.
>>>>>> >>
>>>>>> >> This was a question around the VARCHAR(k) type (which in many
>>>>>> >> databases is distinct from a TEXT type in which any value can be
>>>>>> >> arbitrary length). So if you have a VARCHAR(50), you guarantee that no
>>>>>> >> value exceeds 50 characters. In Arrow I suppose this is just metadata
>>>>>> >> because you have the offsets encoding length (pardon the jet lag).
>>>>>> >> Micah -- I think we can nix the `VarcharType` in the C++ code,
>>>>>> >> leftovers from my earliest draft implementation.
>>>>>> >>
>>>>>> >> - Wes
>>>>>>

Re: Some minor points from ARROW-94 (https://github.com/apache/arrow/pull/58)

Reply via email to