Re: Some minor points from ARROW-94 (https://github.com/apache/arrow/pull/58)

Wes McKinney Fri, 22 Apr 2016 14:43:42 -0700

On Fri, Apr 22, 2016 at 4:56 PM, Micah Kornfield <[email protected]> wrote:
> I like the current scheme of making String (UTF8) a primitive type in
> regards to RPC but not modeling it as a special Array type.  I think
> the key is formally describing how logical types map to physical types
> either is the Flatbuffer schema or in a separate document.
>
> I think there are two use-cases here:
> 1.  Reconstructing Array's off the wire.
> 2.  Writing algorithms/builders to deal with specific logical types
> built on Arrays.
>
> For case 1, I think it is simpler to not special case string types as
> primitives.  Understanding that a logical String type maps to a
> List<Utf8> should be sufficient and allows us to re-use the
> serialization code for ListArrays for these types.
>


It is simpler for the IPC serde code-path. I'll let Jacques comment
but one downside of having strings as a nested type is that there are
certain code paths (for example: Parquet-related) which deal with the
flat table case. To make a Parquet analogy, there is the special
BYTE_ARRAY primitive type, even though you could technically represent
variable-length binary data using a repeated field and using
repetition/definition levels (but the encoding/decoding overhead for
this in Parquet is much more significant than Arrow). There may be
other reasons.

> For case 2, it would be nice to utilize the type system of the host
> programming language to express the semantics of a function call (e.g.
> ParseString(StringArray strings) vs ParseString(ListArray strings),
> but I think this can be implemented without requiring a new primitive
> type in the spec.
>
> The more interesting thing to me is if we should have a new primitive
> type for fixed length lists (e.g. the logical type CHAR).   The
> offsets array isn't necessary in this case for random access.
>
> Also, the way the VARCHAR types (based on a comment in the C++
> (https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L63)
> are currently described as a null terminated UTF8 is problematic.  I
> believe null bytes are valid UTF8 characters.
>
>

Good point, sorry about that. We probably would need to length-prefix
the values, then.

Re: Some minor points from ARROW-94 (https://github.com/apache/arrow/pull/58)

Reply via email to