Let's do a quick hangout on this. I'd like to better understand as I'm not
sure we're all talking about the same thing.

On Thu, Apr 28, 2016 at 5:30 PM, Micah Kornfield <emkornfi...@gmail.com>
wrote:

> I'm -1 on making a new primitive type in the memory layout spec [1].
>
> +1 on clarifying [2], to indicate it is expected that the "Values
> array" for Utf8 and Binary types should never contain null elements.
>
> [1] https://github.com/apache/arrow/blob/master/format/Layout.md
> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs
>
> On Thu, Apr 28, 2016 at 3:08 PM, Wes McKinney <w...@cloudera.com> wrote:
> > Bumping this conversation.
> >
> > I'm +0 on making VARBINARY and String (identical VARBINARY but with a
> > UTF8 guarantee) primitive types in the spec. Let me know what others
> > think.
> >
> > Thanks
> >
> > On Fri, Apr 22, 2016 at 6:30 PM, Wes McKinney <w...@cloudera.com> wrote:
> >> On Fri, Apr 22, 2016 at 6:06 PM, Jacques Nadeau <jacq...@apache.org>
> wrote:
> >>> On Fri, Apr 22, 2016 at 2:42 PM, Wes McKinney <w...@cloudera.com>
> wrote:
> >>>
> >>>> On Fri, Apr 22, 2016 at 4:56 PM, Micah Kornfield <
> emkornfi...@gmail.com>
> >>>> wrote:
> >>>> > I like the current scheme of making String (UTF8) a primitive type
> in
> >>>> > regards to RPC but not modeling it as a special Array type.  I think
> >>>> > the key is formally describing how logical types map to physical
> types
> >>>> > either is the Flatbuffer schema or in a separate document.
> >>>> >
> >>>> > I think there are two use-cases here:
> >>>> > 1.  Reconstructing Array's off the wire.
> >>>> > 2.  Writing algorithms/builders to deal with specific logical types
> >>>> > built on Arrays.
> >>>> >
> >>>> > For case 1, I think it is simpler to not special case string types
> as
> >>>> > primitives.  Understanding that a logical String type maps to a
> >>>> > List<Utf8> should be sufficient and allows us to re-use the
> >>>> > serialization code for ListArrays for these types.
> >>>> >
> >>>>
> >>>> It is simpler for the IPC serde code-path. I'll let Jacques comment
> >>>> but one downside of having strings as a nested type is that there are
> >>>> certain code paths (for example: Parquet-related) which deal with the
> >>>> flat table case. To make a Parquet analogy, there is the special
> >>>> BYTE_ARRAY primitive type, even though you could technically represent
> >>>> variable-length binary data using a repeated field and using
> >>>> repetition/definition levels (but the encoding/decoding overhead for
> >>>> this in Parquet is much more significant than Arrow). There may be
> >>>> other reasons.
> >>>>
> >>>
> >>> I'm a bit confused about what everyone means. I didn't actually realize
> >>> that this [1] had been merged yet but I'm generally on board with how
> it is
> >>> constructed.
> >>>
> >>> With regards to the c++ implementation of the items at [1], abstracting
> >>> shared physical representations out seems fine to me but I don't think
> we
> >>> should necessitate effective 3NF for [1].
> >>>
> >>> One of the key points that I'm focused on in the Java space is that I'd
> >>> like to move to an always nullable pattern. This is vastly simplifying
> from
> >>> a code generation, casting and complexity perspective and is a nominal
> cost
> >>> when using column execution. If binary and varchar are primitive types
> as
> >>> there there is no weird special casing of avoiding the nullability
> bitmap
> >>> in the case of variable width items (for the offsets). But that is an
> >>> implementation detail of the Java library.
> >>>
> >>> So in general, I like the scheme at [1] for the concepts that we all
> are
> >>> talking about (as opposed to eliminating lines 67 & 68)
> >>>
> >>> [1] https://github.com/apache/arrow/blob/master/format/Message.fbs
> >>>
> >>
> >> Well, the issue is that mapping of metadata onto memory layout for IPC
> >> purposes, at least. You can use the List code path for arbitrary List
> >> types as well as strings and binary. It sounds like either way on the
> >> Java side you're going to collapse UTF8 / BINARY into a primitive so
> >> that you don't have to manage a separate never-used bitmap for the
> >> string/binary data. It seems useful enough to me to have a primitive
> >> variable-length binary/UTF8 type but I do not feel strongly about it.
> >>
> >>>
> >>>
> >>>> > For case 2, it would be nice to utilize the type system of the host
> >>>> > programming language to express the semantics of a function call
> (e.g.
> >>>> > ParseString(StringArray strings) vs ParseString(ListArray strings),
> >>>> > but I think this can be implemented without requiring a new
> primitive
> >>>> > type in the spec.
> >>>> >
> >>>> > The more interesting thing to me is if we should have a new
> primitive
> >>>> > type for fixed length lists (e.g. the logical type CHAR).   The
> >>>> > offsets array isn't necessary in this case for random access.
> >>>> >
> >>>> > Also, the way the VARCHAR types (based on a comment in the C++
> >>>> > (
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L63)
> >>>> > are currently described as a null terminated UTF8 is problematic.  I
> >>>> > believe null bytes are valid UTF8 characters.
> >>>>
> >>>>
> >>>> >
> >>>>
> >>>> Good point, sorry about that. We probably would need to length-prefix
> >>>> the values, then.
> >>>>
> >>>
> >>>
> >>> Is this an input/output interface? Arrow structures should all be 4
> byte
> >>> offset based and be neither length prefixed nor null terminated.
> >>
> >> This was a question around the VARCHAR(k) type (which in many
> >> databases is distinct from a TEXT type in which any value can be
> >> arbitrary length). So if you have a VARCHAR(50), you guarantee that no
> >> value exceeds 50 characters. In Arrow I suppose this is just metadata
> >> because you have the offsets encoding length (pardon the jet lag).
> >> Micah -- I think we can nix the `VarcharType` in the C++ code,
> >> leftovers from my earliest draft implementation.
> >>
> >> - Wes
>

Reply via email to