Re: Some minor points from ARROW-94 (https://github.com/apache/arrow/pull/58)

Jacques Nadeau Wed, 25 May 2016 14:45:12 -0700

Let's start by creating a simple usecase. For example, I would start with
nullable 4 byte integer, maybe and use the example of java > (col1) >
python (or c++) > (newcol) > java that is one what I'd call a single batch
algorithm (e.g. one batch of values in, one out).


A simple way to sidestep the memory management/reference counting issues
initially is for java to preallocate the output location for newcol for the
python (or c++) code.

On Wed, May 25, 2016 at 1:25 PM, Micah Kornfield <[email protected]>
wrote:

> Just to follow-up on this.  I got distracted on a few other items on
> the C++ implementation side, but my next task is to get a String types
> working for the C++ IPC unit test.   Once I send a PR for that, it
> might help clarify the concerns on both sides and we can hammer out
> the details from there.
>
> Sound reasonable?
>
> -Micah
>
> On Fri, May 13, 2016 at 10:33 AM, Wes McKinney <[email protected]>
> wrote:
> > Nudging this issue. We need to sketch out a plan to get IPC
> > integration tests working between the Java and C++ implementations --
> > what's the most expedient way we can work toward making that happen?
> >
> > On Sun, May 1, 2016 at 1:02 AM, Micah Kornfield <[email protected]>
> wrote:
> >> s/spark/slack/g
> >>
> >> On Sun, May 1, 2016 at 12:58 AM, Micah Kornfield <[email protected]>
> wrote:
> >>> I'm not exactly sure of my availability if I am available on spark, I
> >>> can likely make the hangout.
> >>>
> >>> On Fri, Apr 29, 2016 at 4:40 PM, Wes McKinney <[email protected]>
> wrote:
> >>>> I was traveling today but I can do a hangout about this next week.
> >>>>
> >>>> On Thu, Apr 28, 2016 at 7:53 PM, Jacques Nadeau <[email protected]>
> wrote:
> >>>>> Let's do a quick hangout on this. I'd like to better understand as
> I'm not
> >>>>> sure we're all talking about the same thing.
> >>>>>
> >>>>> On Thu, Apr 28, 2016 at 5:30 PM, Micah Kornfield <
> [email protected]>
> >>>>> wrote:
> >>>>>
> >>>>>> I'm -1 on making a new primitive type in the memory layout spec [1].
> >>>>>>
> >>>>>> +1 on clarifying [2], to indicate it is expected that the "Values
> >>>>>> array" for Utf8 and Binary types should never contain null elements.
> >>>>>>
> >>>>>> [1] https://github.com/apache/arrow/blob/master/format/Layout.md
> >>>>>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs
> >>>>>>
> >>>>>> On Thu, Apr 28, 2016 at 3:08 PM, Wes McKinney <[email protected]>
> wrote:
> >>>>>> > Bumping this conversation.
> >>>>>> >
> >>>>>> > I'm +0 on making VARBINARY and String (identical VARBINARY but
> with a
> >>>>>> > UTF8 guarantee) primitive types in the spec. Let me know what
> others
> >>>>>> > think.
> >>>>>> >
> >>>>>> > Thanks
> >>>>>> >
> >>>>>> > On Fri, Apr 22, 2016 at 6:30 PM, Wes McKinney <[email protected]>
> wrote:
> >>>>>> >> On Fri, Apr 22, 2016 at 6:06 PM, Jacques Nadeau <
> [email protected]>
> >>>>>> wrote:
> >>>>>> >>> On Fri, Apr 22, 2016 at 2:42 PM, Wes McKinney <[email protected]
> >
> >>>>>> wrote:
> >>>>>> >>>
> >>>>>> >>>> On Fri, Apr 22, 2016 at 4:56 PM, Micah Kornfield <
> >>>>>> [email protected]>
> >>>>>> >>>> wrote:
> >>>>>> >>>> > I like the current scheme of making String (UTF8) a
> primitive type
> >>>>>> in
> >>>>>> >>>> > regards to RPC but not modeling it as a special Array type.
> I think
> >>>>>> >>>> > the key is formally describing how logical types map to
> physical
> >>>>>> types
> >>>>>> >>>> > either is the Flatbuffer schema or in a separate document.
> >>>>>> >>>> >
> >>>>>> >>>> > I think there are two use-cases here:
> >>>>>> >>>> > 1.  Reconstructing Array's off the wire.
> >>>>>> >>>> > 2.  Writing algorithms/builders to deal with specific
> logical types
> >>>>>> >>>> > built on Arrays.
> >>>>>> >>>> >
> >>>>>> >>>> > For case 1, I think it is simpler to not special case string
> types
> >>>>>> as
> >>>>>> >>>> > primitives.  Understanding that a logical String type maps
> to a
> >>>>>> >>>> > List<Utf8> should be sufficient and allows us to re-use the
> >>>>>> >>>> > serialization code for ListArrays for these types.
> >>>>>> >>>> >
> >>>>>> >>>>
> >>>>>> >>>> It is simpler for the IPC serde code-path. I'll let Jacques
> comment
> >>>>>> >>>> but one downside of having strings as a nested type is that
> there are
> >>>>>> >>>> certain code paths (for example: Parquet-related) which deal
> with the
> >>>>>> >>>> flat table case. To make a Parquet analogy, there is the
> special
> >>>>>> >>>> BYTE_ARRAY primitive type, even though you could technically
> represent
> >>>>>> >>>> variable-length binary data using a repeated field and using
> >>>>>> >>>> repetition/definition levels (but the encoding/decoding
> overhead for
> >>>>>> >>>> this in Parquet is much more significant than Arrow). There
> may be
> >>>>>> >>>> other reasons.
> >>>>>> >>>>
> >>>>>> >>>
> >>>>>> >>> I'm a bit confused about what everyone means. I didn't actually
> realize
> >>>>>> >>> that this [1] had been merged yet but I'm generally on board
> with how
> >>>>>> it is
> >>>>>> >>> constructed.
> >>>>>> >>>
> >>>>>> >>> With regards to the c++ implementation of the items at [1],
> abstracting
> >>>>>> >>> shared physical representations out seems fine to me but I
> don't think
> >>>>>> we
> >>>>>> >>> should necessitate effective 3NF for [1].
> >>>>>> >>>
> >>>>>> >>> One of the key points that I'm focused on in the Java space is
> that I'd
> >>>>>> >>> like to move to an always nullable pattern. This is vastly
> simplifying
> >>>>>> from
> >>>>>> >>> a code generation, casting and complexity perspective and is a
> nominal
> >>>>>> cost
> >>>>>> >>> when using column execution. If binary and varchar are
> primitive types
> >>>>>> as
> >>>>>> >>> there there is no weird special casing of avoiding the
> nullability
> >>>>>> bitmap
> >>>>>> >>> in the case of variable width items (for the offsets). But that
> is an
> >>>>>> >>> implementation detail of the Java library.
> >>>>>> >>>
> >>>>>> >>> So in general, I like the scheme at [1] for the concepts that
> we all
> >>>>>> are
> >>>>>> >>> talking about (as opposed to eliminating lines 67 & 68)
> >>>>>> >>>
> >>>>>> >>> [1]
> https://github.com/apache/arrow/blob/master/format/Message.fbs
> >>>>>> >>>
> >>>>>> >>
> >>>>>> >> Well, the issue is that mapping of metadata onto memory layout
> for IPC
> >>>>>> >> purposes, at least. You can use the List code path for arbitrary
> List
> >>>>>> >> types as well as strings and binary. It sounds like either way
> on the
> >>>>>> >> Java side you're going to collapse UTF8 / BINARY into a
> primitive so
> >>>>>> >> that you don't have to manage a separate never-used bitmap for
> the
> >>>>>> >> string/binary data. It seems useful enough to me to have a
> primitive
> >>>>>> >> variable-length binary/UTF8 type but I do not feel strongly
> about it.
> >>>>>> >>
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>>> > For case 2, it would be nice to utilize the type system of
> the host
> >>>>>> >>>> > programming language to express the semantics of a function
> call
> >>>>>> (e.g.
> >>>>>> >>>> > ParseString(StringArray strings) vs ParseString(ListArray
> strings),
> >>>>>> >>>> > but I think this can be implemented without requiring a new
> >>>>>> primitive
> >>>>>> >>>> > type in the spec.
> >>>>>> >>>> >
> >>>>>> >>>> > The more interesting thing to me is if we should have a new
> >>>>>> primitive
> >>>>>> >>>> > type for fixed length lists (e.g. the logical type CHAR).
>  The
> >>>>>> >>>> > offsets array isn't necessary in this case for random access.
> >>>>>> >>>> >
> >>>>>> >>>> > Also, the way the VARCHAR types (based on a comment in the
> C++
> >>>>>> >>>> > (
> >>>>>>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L63)
> >>>>>> >>>> > are currently described as a null terminated UTF8 is
> problematic.  I
> >>>>>> >>>> > believe null bytes are valid UTF8 characters.
> >>>>>> >>>>
> >>>>>> >>>>
> >>>>>> >>>> >
> >>>>>> >>>>
> >>>>>> >>>> Good point, sorry about that. We probably would need to
> length-prefix
> >>>>>> >>>> the values, then.
> >>>>>> >>>>
> >>>>>> >>>
> >>>>>> >>>
> >>>>>> >>> Is this an input/output interface? Arrow structures should all
> be 4
> >>>>>> byte
> >>>>>> >>> offset based and be neither length prefixed nor null terminated.
> >>>>>> >>
> >>>>>> >> This was a question around the VARCHAR(k) type (which in many
> >>>>>> >> databases is distinct from a TEXT type in which any value can be
> >>>>>> >> arbitrary length). So if you have a VARCHAR(50), you guarantee
> that no
> >>>>>> >> value exceeds 50 characters. In Arrow I suppose this is just
> metadata
> >>>>>> >> because you have the offsets encoding length (pardon the jet
> lag).
> >>>>>> >> Micah -- I think we can nix the `VarcharType` in the C++ code,
> >>>>>> >> leftovers from my earliest draft implementation.
> >>>>>> >>
> >>>>>> >> - Wes
> >>>>>>
>

Re: Some minor points from ARROW-94 (https://github.com/apache/arrow/pull/58)

Reply via email to