Re: Some minor points from ARROW-94 (https://github.com/apache/arrow/pull/58)

Wes McKinney Fri, 22 Apr 2016 15:32:14 -0700

On Fri, Apr 22, 2016 at 6:06 PM, Jacques Nadeau <[email protected]> wrote:
> On Fri, Apr 22, 2016 at 2:42 PM, Wes McKinney <[email protected]> wrote:
>
>> On Fri, Apr 22, 2016 at 4:56 PM, Micah Kornfield <[email protected]>
>> wrote:
>> > I like the current scheme of making String (UTF8) a primitive type in
>> > regards to RPC but not modeling it as a special Array type.  I think
>> > the key is formally describing how logical types map to physical types
>> > either is the Flatbuffer schema or in a separate document.
>> >
>> > I think there are two use-cases here:
>> > 1.  Reconstructing Array's off the wire.
>> > 2.  Writing algorithms/builders to deal with specific logical types
>> > built on Arrays.
>> >
>> > For case 1, I think it is simpler to not special case string types as
>> > primitives.  Understanding that a logical String type maps to a
>> > List<Utf8> should be sufficient and allows us to re-use the
>> > serialization code for ListArrays for these types.
>> >
>>
>> It is simpler for the IPC serde code-path. I'll let Jacques comment
>> but one downside of having strings as a nested type is that there are
>> certain code paths (for example: Parquet-related) which deal with the
>> flat table case. To make a Parquet analogy, there is the special
>> BYTE_ARRAY primitive type, even though you could technically represent
>> variable-length binary data using a repeated field and using
>> repetition/definition levels (but the encoding/decoding overhead for
>> this in Parquet is much more significant than Arrow). There may be
>> other reasons.
>>
>
> I'm a bit confused about what everyone means. I didn't actually realize
> that this [1] had been merged yet but I'm generally on board with how it is
> constructed.
>
> With regards to the c++ implementation of the items at [1], abstracting
> shared physical representations out seems fine to me but I don't think we
> should necessitate effective 3NF for [1].
>
> One of the key points that I'm focused on in the Java space is that I'd
> like to move to an always nullable pattern. This is vastly simplifying from
> a code generation, casting and complexity perspective and is a nominal cost
> when using column execution. If binary and varchar are primitive types as
> there there is no weird special casing of avoiding the nullability bitmap
> in the case of variable width items (for the offsets). But that is an
> implementation detail of the Java library.
>
> So in general, I like the scheme at [1] for the concepts that we all are
> talking about (as opposed to eliminating lines 67 & 68)
>
> [1] https://github.com/apache/arrow/blob/master/format/Message.fbs
>


Well, the issue is that mapping of metadata onto memory layout for IPC
purposes, at least. You can use the List code path for arbitrary List
types as well as strings and binary. It sounds like either way on the
Java side you're going to collapse UTF8 / BINARY into a primitive so
that you don't have to manage a separate never-used bitmap for the
string/binary data. It seems useful enough to me to have a primitive
variable-length binary/UTF8 type but I do not feel strongly about it.

>
>
>> > For case 2, it would be nice to utilize the type system of the host
>> > programming language to express the semantics of a function call (e.g.
>> > ParseString(StringArray strings) vs ParseString(ListArray strings),
>> > but I think this can be implemented without requiring a new primitive
>> > type in the spec.
>> >
>> > The more interesting thing to me is if we should have a new primitive
>> > type for fixed length lists (e.g. the logical type CHAR).   The
>> > offsets array isn't necessary in this case for random access.
>> >
>> > Also, the way the VARCHAR types (based on a comment in the C++
>> > (https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L63)
>> > are currently described as a null terminated UTF8 is problematic.  I
>> > believe null bytes are valid UTF8 characters.
>>
>>
>> >
>>
>> Good point, sorry about that. We probably would need to length-prefix
>> the values, then.
>>
>
>
> Is this an input/output interface? Arrow structures should all be 4 byte
> offset based and be neither length prefixed nor null terminated.

This was a question around the VARCHAR(k) type (which in many
databases is distinct from a TEXT type in which any value can be
arbitrary length). So if you have a VARCHAR(50), you guarantee that no
value exceeds 50 characters. In Arrow I suppose this is just metadata
because you have the offsets encoding length (pardon the jet lag).
Micah -- I think we can nix the `VarcharType` in the C++ code,
leftovers from my earliest draft implementation.

- Wes

Re: Some minor points from ARROW-94 (https://github.com/apache/arrow/pull/58)

Reply via email to