Re: Some minor points from ARROW-94 (https://github.com/apache/arrow/pull/58)

Jacques Nadeau Wed, 25 May 2016 15:00:17 -0700

By usecase, I really meant "hello world"

On Wed, May 25, 2016 at 2:09 PM, Jacques Nadeau <jacq...@apache.org> wrote:


> Let's start by creating a simple usecase. For example, I would start with
> nullable 4 byte integer, maybe and use the example of java > (col1) >
> python (or c++) > (newcol) > java that is one what I'd call a single batch
> algorithm (e.g. one batch of values in, one out).
>
> A simple way to sidestep the memory management/reference counting issues
> initially is for java to preallocate the output location for newcol for the
> python (or c++) code.
>
> On Wed, May 25, 2016 at 1:25 PM, Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> Just to follow-up on this.  I got distracted on a few other items on
>> the C++ implementation side, but my next task is to get a String types
>> working for the C++ IPC unit test.   Once I send a PR for that, it
>> might help clarify the concerns on both sides and we can hammer out
>> the details from there.
>>
>> Sound reasonable?
>>
>> -Micah
>>
>> On Fri, May 13, 2016 at 10:33 AM, Wes McKinney <wesmck...@gmail.com>
>> wrote:
>> > Nudging this issue. We need to sketch out a plan to get IPC
>> > integration tests working between the Java and C++ implementations --
>> > what's the most expedient way we can work toward making that happen?
>> >
>> > On Sun, May 1, 2016 at 1:02 AM, Micah Kornfield <emkornfi...@gmail.com>
>> wrote:
>> >> s/spark/slack/g
>> >>
>> >> On Sun, May 1, 2016 at 12:58 AM, Micah Kornfield <
>> emkornfi...@gmail.com> wrote:
>> >>> I'm not exactly sure of my availability if I am available on spark, I
>> >>> can likely make the hangout.
>> >>>
>> >>> On Fri, Apr 29, 2016 at 4:40 PM, Wes McKinney <w...@cloudera.com>
>> wrote:
>> >>>> I was traveling today but I can do a hangout about this next week.
>> >>>>
>> >>>> On Thu, Apr 28, 2016 at 7:53 PM, Jacques Nadeau <jacq...@apache.org>
>> wrote:
>> >>>>> Let's do a quick hangout on this. I'd like to better understand as
>> I'm not
>> >>>>> sure we're all talking about the same thing.
>> >>>>>
>> >>>>> On Thu, Apr 28, 2016 at 5:30 PM, Micah Kornfield <
>> emkornfi...@gmail.com>
>> >>>>> wrote:
>> >>>>>
>> >>>>>> I'm -1 on making a new primitive type in the memory layout spec
>> [1].
>> >>>>>>
>> >>>>>> +1 on clarifying [2], to indicate it is expected that the "Values
>> >>>>>> array" for Utf8 and Binary types should never contain null
>> elements.
>> >>>>>>
>> >>>>>> [1] https://github.com/apache/arrow/blob/master/format/Layout.md
>> >>>>>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs
>> >>>>>>
>> >>>>>> On Thu, Apr 28, 2016 at 3:08 PM, Wes McKinney <w...@cloudera.com>
>> wrote:
>> >>>>>> > Bumping this conversation.
>> >>>>>> >
>> >>>>>> > I'm +0 on making VARBINARY and String (identical VARBINARY but
>> with a
>> >>>>>> > UTF8 guarantee) primitive types in the spec. Let me know what
>> others
>> >>>>>> > think.
>> >>>>>> >
>> >>>>>> > Thanks
>> >>>>>> >
>> >>>>>> > On Fri, Apr 22, 2016 at 6:30 PM, Wes McKinney <w...@cloudera.com>
>> wrote:
>> >>>>>> >> On Fri, Apr 22, 2016 at 6:06 PM, Jacques Nadeau <
>> jacq...@apache.org>
>> >>>>>> wrote:
>> >>>>>> >>> On Fri, Apr 22, 2016 at 2:42 PM, Wes McKinney <
>> w...@cloudera.com>
>> >>>>>> wrote:
>> >>>>>> >>>
>> >>>>>> >>>> On Fri, Apr 22, 2016 at 4:56 PM, Micah Kornfield <
>> >>>>>> emkornfi...@gmail.com>
>> >>>>>> >>>> wrote:
>> >>>>>> >>>> > I like the current scheme of making String (UTF8) a
>> primitive type
>> >>>>>> in
>> >>>>>> >>>> > regards to RPC but not modeling it as a special Array
>> type.  I think
>> >>>>>> >>>> > the key is formally describing how logical types map to
>> physical
>> >>>>>> types
>> >>>>>> >>>> > either is the Flatbuffer schema or in a separate document.
>> >>>>>> >>>> >
>> >>>>>> >>>> > I think there are two use-cases here:
>> >>>>>> >>>> > 1.  Reconstructing Array's off the wire.
>> >>>>>> >>>> > 2.  Writing algorithms/builders to deal with specific
>> logical types
>> >>>>>> >>>> > built on Arrays.
>> >>>>>> >>>> >
>> >>>>>> >>>> > For case 1, I think it is simpler to not special case
>> string types
>> >>>>>> as
>> >>>>>> >>>> > primitives.  Understanding that a logical String type maps
>> to a
>> >>>>>> >>>> > List<Utf8> should be sufficient and allows us to re-use the
>> >>>>>> >>>> > serialization code for ListArrays for these types.
>> >>>>>> >>>> >
>> >>>>>> >>>>
>> >>>>>> >>>> It is simpler for the IPC serde code-path. I'll let Jacques
>> comment
>> >>>>>> >>>> but one downside of having strings as a nested type is that
>> there are
>> >>>>>> >>>> certain code paths (for example: Parquet-related) which deal
>> with the
>> >>>>>> >>>> flat table case. To make a Parquet analogy, there is the
>> special
>> >>>>>> >>>> BYTE_ARRAY primitive type, even though you could technically
>> represent
>> >>>>>> >>>> variable-length binary data using a repeated field and using
>> >>>>>> >>>> repetition/definition levels (but the encoding/decoding
>> overhead for
>> >>>>>> >>>> this in Parquet is much more significant than Arrow). There
>> may be
>> >>>>>> >>>> other reasons.
>> >>>>>> >>>>
>> >>>>>> >>>
>> >>>>>> >>> I'm a bit confused about what everyone means. I didn't
>> actually realize
>> >>>>>> >>> that this [1] had been merged yet but I'm generally on board
>> with how
>> >>>>>> it is
>> >>>>>> >>> constructed.
>> >>>>>> >>>
>> >>>>>> >>> With regards to the c++ implementation of the items at [1],
>> abstracting
>> >>>>>> >>> shared physical representations out seems fine to me but I
>> don't think
>> >>>>>> we
>> >>>>>> >>> should necessitate effective 3NF for [1].
>> >>>>>> >>>
>> >>>>>> >>> One of the key points that I'm focused on in the Java space is
>> that I'd
>> >>>>>> >>> like to move to an always nullable pattern. This is vastly
>> simplifying
>> >>>>>> from
>> >>>>>> >>> a code generation, casting and complexity perspective and is a
>> nominal
>> >>>>>> cost
>> >>>>>> >>> when using column execution. If binary and varchar are
>> primitive types
>> >>>>>> as
>> >>>>>> >>> there there is no weird special casing of avoiding the
>> nullability
>> >>>>>> bitmap
>> >>>>>> >>> in the case of variable width items (for the offsets). But
>> that is an
>> >>>>>> >>> implementation detail of the Java library.
>> >>>>>> >>>
>> >>>>>> >>> So in general, I like the scheme at [1] for the concepts that
>> we all
>> >>>>>> are
>> >>>>>> >>> talking about (as opposed to eliminating lines 67 & 68)
>> >>>>>> >>>
>> >>>>>> >>> [1]
>> https://github.com/apache/arrow/blob/master/format/Message.fbs
>> >>>>>> >>>
>> >>>>>> >>
>> >>>>>> >> Well, the issue is that mapping of metadata onto memory layout
>> for IPC
>> >>>>>> >> purposes, at least. You can use the List code path for
>> arbitrary List
>> >>>>>> >> types as well as strings and binary. It sounds like either way
>> on the
>> >>>>>> >> Java side you're going to collapse UTF8 / BINARY into a
>> primitive so
>> >>>>>> >> that you don't have to manage a separate never-used bitmap for
>> the
>> >>>>>> >> string/binary data. It seems useful enough to me to have a
>> primitive
>> >>>>>> >> variable-length binary/UTF8 type but I do not feel strongly
>> about it.
>> >>>>>> >>
>> >>>>>> >>>
>> >>>>>> >>>
>> >>>>>> >>>> > For case 2, it would be nice to utilize the type system of
>> the host
>> >>>>>> >>>> > programming language to express the semantics of a function
>> call
>> >>>>>> (e.g.
>> >>>>>> >>>> > ParseString(StringArray strings) vs ParseString(ListArray
>> strings),
>> >>>>>> >>>> > but I think this can be implemented without requiring a new
>> >>>>>> primitive
>> >>>>>> >>>> > type in the spec.
>> >>>>>> >>>> >
>> >>>>>> >>>> > The more interesting thing to me is if we should have a new
>> >>>>>> primitive
>> >>>>>> >>>> > type for fixed length lists (e.g. the logical type CHAR).
>>  The
>> >>>>>> >>>> > offsets array isn't necessary in this case for random
>> access.
>> >>>>>> >>>> >
>> >>>>>> >>>> > Also, the way the VARCHAR types (based on a comment in the
>> C++
>> >>>>>> >>>> > (
>> >>>>>>
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L63)
>> >>>>>> >>>> > are currently described as a null terminated UTF8 is
>> problematic.  I
>> >>>>>> >>>> > believe null bytes are valid UTF8 characters.
>> >>>>>> >>>>
>> >>>>>> >>>>
>> >>>>>> >>>> >
>> >>>>>> >>>>
>> >>>>>> >>>> Good point, sorry about that. We probably would need to
>> length-prefix
>> >>>>>> >>>> the values, then.
>> >>>>>> >>>>
>> >>>>>> >>>
>> >>>>>> >>>
>> >>>>>> >>> Is this an input/output interface? Arrow structures should all
>> be 4
>> >>>>>> byte
>> >>>>>> >>> offset based and be neither length prefixed nor null
>> terminated.
>> >>>>>> >>
>> >>>>>> >> This was a question around the VARCHAR(k) type (which in many
>> >>>>>> >> databases is distinct from a TEXT type in which any value can be
>> >>>>>> >> arbitrary length). So if you have a VARCHAR(50), you guarantee
>> that no
>> >>>>>> >> value exceeds 50 characters. In Arrow I suppose this is just
>> metadata
>> >>>>>> >> because you have the offsets encoding length (pardon the jet
>> lag).
>> >>>>>> >> Micah -- I think we can nix the `VarcharType` in the C++ code,
>> >>>>>> >> leftovers from my earliest draft implementation.
>> >>>>>> >>
>> >>>>>> >> - Wes
>> >>>>>>
>>
>
>

Re: Some minor points from ARROW-94 (https://github.com/apache/arrow/pull/58)

Reply via email to