Re: Some minor points from ARROW-94 (https://github.com/apache/arrow/pull/58)

Wes McKinney Thu, 28 Apr 2016 15:09:49 -0700

Bumping this conversation.

I'm +0 on making VARBINARY and String (identical VARBINARY but with a
UTF8 guarantee) primitive types in the spec. Let me know what others
think.


Thanks

On Fri, Apr 22, 2016 at 6:30 PM, Wes McKinney <[email protected]> wrote:
> On Fri, Apr 22, 2016 at 6:06 PM, Jacques Nadeau <[email protected]> wrote:
>> On Fri, Apr 22, 2016 at 2:42 PM, Wes McKinney <[email protected]> wrote:
>>
>>> On Fri, Apr 22, 2016 at 4:56 PM, Micah Kornfield <[email protected]>
>>> wrote:
>>> > I like the current scheme of making String (UTF8) a primitive type in
>>> > regards to RPC but not modeling it as a special Array type.  I think
>>> > the key is formally describing how logical types map to physical types
>>> > either is the Flatbuffer schema or in a separate document.
>>> >
>>> > I think there are two use-cases here:
>>> > 1.  Reconstructing Array's off the wire.
>>> > 2.  Writing algorithms/builders to deal with specific logical types
>>> > built on Arrays.
>>> >
>>> > For case 1, I think it is simpler to not special case string types as
>>> > primitives.  Understanding that a logical String type maps to a
>>> > List<Utf8> should be sufficient and allows us to re-use the
>>> > serialization code for ListArrays for these types.
>>> >
>>>
>>> It is simpler for the IPC serde code-path. I'll let Jacques comment
>>> but one downside of having strings as a nested type is that there are
>>> certain code paths (for example: Parquet-related) which deal with the
>>> flat table case. To make a Parquet analogy, there is the special
>>> BYTE_ARRAY primitive type, even though you could technically represent
>>> variable-length binary data using a repeated field and using
>>> repetition/definition levels (but the encoding/decoding overhead for
>>> this in Parquet is much more significant than Arrow). There may be
>>> other reasons.
>>>
>>
>> I'm a bit confused about what everyone means. I didn't actually realize
>> that this [1] had been merged yet but I'm generally on board with how it is
>> constructed.
>>
>> With regards to the c++ implementation of the items at [1], abstracting
>> shared physical representations out seems fine to me but I don't think we
>> should necessitate effective 3NF for [1].
>>
>> One of the key points that I'm focused on in the Java space is that I'd
>> like to move to an always nullable pattern. This is vastly simplifying from
>> a code generation, casting and complexity perspective and is a nominal cost
>> when using column execution. If binary and varchar are primitive types as
>> there there is no weird special casing of avoiding the nullability bitmap
>> in the case of variable width items (for the offsets). But that is an
>> implementation detail of the Java library.
>>
>> So in general, I like the scheme at [1] for the concepts that we all are
>> talking about (as opposed to eliminating lines 67 & 68)
>>
>> [1] https://github.com/apache/arrow/blob/master/format/Message.fbs
>>
>
> Well, the issue is that mapping of metadata onto memory layout for IPC
> purposes, at least. You can use the List code path for arbitrary List
> types as well as strings and binary. It sounds like either way on the
> Java side you're going to collapse UTF8 / BINARY into a primitive so
> that you don't have to manage a separate never-used bitmap for the
> string/binary data. It seems useful enough to me to have a primitive
> variable-length binary/UTF8 type but I do not feel strongly about it.
>
>>
>>
>>> > For case 2, it would be nice to utilize the type system of the host
>>> > programming language to express the semantics of a function call (e.g.
>>> > ParseString(StringArray strings) vs ParseString(ListArray strings),
>>> > but I think this can be implemented without requiring a new primitive
>>> > type in the spec.
>>> >
>>> > The more interesting thing to me is if we should have a new primitive
>>> > type for fixed length lists (e.g. the logical type CHAR).   The
>>> > offsets array isn't necessary in this case for random access.
>>> >
>>> > Also, the way the VARCHAR types (based on a comment in the C++
>>> > (https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L63)
>>> > are currently described as a null terminated UTF8 is problematic.  I
>>> > believe null bytes are valid UTF8 characters.
>>>
>>>
>>> >
>>>
>>> Good point, sorry about that. We probably would need to length-prefix
>>> the values, then.
>>>
>>
>>
>> Is this an input/output interface? Arrow structures should all be 4 byte
>> offset based and be neither length prefixed nor null terminated.
>
> This was a question around the VARCHAR(k) type (which in many
> databases is distinct from a TEXT type in which any value can be
> arbitrary length). So if you have a VARCHAR(50), you guarantee that no
> value exceeds 50 characters. In Arrow I suppose this is just metadata
> because you have the offsets encoding length (pardon the jet lag).
> Micah -- I think we can nix the `VarcharType` in the C++ code,
> leftovers from my earliest draft implementation.
>
> - Wes

Re: Some minor points from ARROW-94 (https://github.com/apache/arrow/pull/58)

Reply via email to