Re: Some minor points from ARROW-94 (https://github.com/apache/arrow/pull/58)

Micah Kornfield Wed, 25 May 2016 15:55:20 -0700

"hello world" makes sense as a good place to start for general IPC integration.


I thought there was still some disconnect on how strings were going to
be represented.  That was the basis for my suggestion above.  But the
integer use-case bypasses these concerns for now.

On Wed, May 25, 2016 at 2:09 PM, Jacques Nadeau <[email protected]> wrote:
> By usecase, I really meant "hello world"
>
> On Wed, May 25, 2016 at 2:09 PM, Jacques Nadeau <[email protected]> wrote:
>>
>> Let's start by creating a simple usecase. For example, I would start with
>> nullable 4 byte integer, maybe and use the example of java > (col1) > python
>> (or c++) > (newcol) > java that is one what I'd call a single batch
>> algorithm (e.g. one batch of values in, one out).
>>
>> A simple way to sidestep the memory management/reference counting issues
>> initially is for java to preallocate the output location for newcol for the
>> python (or c++) code.
>>
>> On Wed, May 25, 2016 at 1:25 PM, Micah Kornfield <[email protected]>
>> wrote:
>>>
>>> Just to follow-up on this.  I got distracted on a few other items on
>>> the C++ implementation side, but my next task is to get a String types
>>> working for the C++ IPC unit test.   Once I send a PR for that, it
>>> might help clarify the concerns on both sides and we can hammer out
>>> the details from there.
>>>
>>> Sound reasonable?
>>>
>>> -Micah
>>>
>>> On Fri, May 13, 2016 at 10:33 AM, Wes McKinney <[email protected]>
>>> wrote:
>>> > Nudging this issue. We need to sketch out a plan to get IPC
>>> > integration tests working between the Java and C++ implementations --
>>> > what's the most expedient way we can work toward making that happen?
>>> >
>>> > On Sun, May 1, 2016 at 1:02 AM, Micah Kornfield <[email protected]>
>>> > wrote:
>>> >> s/spark/slack/g
>>> >>
>>> >> On Sun, May 1, 2016 at 12:58 AM, Micah Kornfield
>>> >> <[email protected]> wrote:
>>> >>> I'm not exactly sure of my availability if I am available on spark, I
>>> >>> can likely make the hangout.
>>> >>>
>>> >>> On Fri, Apr 29, 2016 at 4:40 PM, Wes McKinney <[email protected]>
>>> >>> wrote:
>>> >>>> I was traveling today but I can do a hangout about this next week.
>>> >>>>
>>> >>>> On Thu, Apr 28, 2016 at 7:53 PM, Jacques Nadeau <[email protected]>
>>> >>>> wrote:
>>> >>>>> Let's do a quick hangout on this. I'd like to better understand as
>>> >>>>> I'm not
>>> >>>>> sure we're all talking about the same thing.
>>> >>>>>
>>> >>>>> On Thu, Apr 28, 2016 at 5:30 PM, Micah Kornfield
>>> >>>>> <[email protected]>
>>> >>>>> wrote:
>>> >>>>>
>>> >>>>>> I'm -1 on making a new primitive type in the memory layout spec
>>> >>>>>> [1].
>>> >>>>>>
>>> >>>>>> +1 on clarifying [2], to indicate it is expected that the "Values
>>> >>>>>> array" for Utf8 and Binary types should never contain null
>>> >>>>>> elements.
>>> >>>>>>
>>> >>>>>> [1] https://github.com/apache/arrow/blob/master/format/Layout.md
>>> >>>>>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs
>>> >>>>>>
>>> >>>>>> On Thu, Apr 28, 2016 at 3:08 PM, Wes McKinney <[email protected]>
>>> >>>>>> wrote:
>>> >>>>>> > Bumping this conversation.
>>> >>>>>> >
>>> >>>>>> > I'm +0 on making VARBINARY and String (identical VARBINARY but
>>> >>>>>> > with a
>>> >>>>>> > UTF8 guarantee) primitive types in the spec. Let me know what
>>> >>>>>> > others
>>> >>>>>> > think.
>>> >>>>>> >
>>> >>>>>> > Thanks
>>> >>>>>> >
>>> >>>>>> > On Fri, Apr 22, 2016 at 6:30 PM, Wes McKinney <[email protected]>
>>> >>>>>> > wrote:
>>> >>>>>> >> On Fri, Apr 22, 2016 at 6:06 PM, Jacques Nadeau
>>> >>>>>> >> <[email protected]>
>>> >>>>>> wrote:
>>> >>>>>> >>> On Fri, Apr 22, 2016 at 2:42 PM, Wes McKinney
>>> >>>>>> >>> <[email protected]>
>>> >>>>>> wrote:
>>> >>>>>> >>>
>>> >>>>>> >>>> On Fri, Apr 22, 2016 at 4:56 PM, Micah Kornfield <
>>> >>>>>> [email protected]>
>>> >>>>>> >>>> wrote:
>>> >>>>>> >>>> > I like the current scheme of making String (UTF8) a
>>> >>>>>> >>>> > primitive type
>>> >>>>>> in
>>> >>>>>> >>>> > regards to RPC but not modeling it as a special Array type.
>>> >>>>>> >>>> > I think
>>> >>>>>> >>>> > the key is formally describing how logical types map to
>>> >>>>>> >>>> > physical
>>> >>>>>> types
>>> >>>>>> >>>> > either is the Flatbuffer schema or in a separate document.
>>> >>>>>> >>>> >
>>> >>>>>> >>>> > I think there are two use-cases here:
>>> >>>>>> >>>> > 1.  Reconstructing Array's off the wire.
>>> >>>>>> >>>> > 2.  Writing algorithms/builders to deal with specific
>>> >>>>>> >>>> > logical types
>>> >>>>>> >>>> > built on Arrays.
>>> >>>>>> >>>> >
>>> >>>>>> >>>> > For case 1, I think it is simpler to not special case
>>> >>>>>> >>>> > string types
>>> >>>>>> as
>>> >>>>>> >>>> > primitives.  Understanding that a logical String type maps
>>> >>>>>> >>>> > to a
>>> >>>>>> >>>> > List<Utf8> should be sufficient and allows us to re-use the
>>> >>>>>> >>>> > serialization code for ListArrays for these types.
>>> >>>>>> >>>> >
>>> >>>>>> >>>>
>>> >>>>>> >>>> It is simpler for the IPC serde code-path. I'll let Jacques
>>> >>>>>> >>>> comment
>>> >>>>>> >>>> but one downside of having strings as a nested type is that
>>> >>>>>> >>>> there are
>>> >>>>>> >>>> certain code paths (for example: Parquet-related) which deal
>>> >>>>>> >>>> with the
>>> >>>>>> >>>> flat table case. To make a Parquet analogy, there is the
>>> >>>>>> >>>> special
>>> >>>>>> >>>> BYTE_ARRAY primitive type, even though you could technically
>>> >>>>>> >>>> represent
>>> >>>>>> >>>> variable-length binary data using a repeated field and using
>>> >>>>>> >>>> repetition/definition levels (but the encoding/decoding
>>> >>>>>> >>>> overhead for
>>> >>>>>> >>>> this in Parquet is much more significant than Arrow). There
>>> >>>>>> >>>> may be
>>> >>>>>> >>>> other reasons.
>>> >>>>>> >>>>
>>> >>>>>> >>>
>>> >>>>>> >>> I'm a bit confused about what everyone means. I didn't
>>> >>>>>> >>> actually realize
>>> >>>>>> >>> that this [1] had been merged yet but I'm generally on board
>>> >>>>>> >>> with how
>>> >>>>>> it is
>>> >>>>>> >>> constructed.
>>> >>>>>> >>>
>>> >>>>>> >>> With regards to the c++ implementation of the items at [1],
>>> >>>>>> >>> abstracting
>>> >>>>>> >>> shared physical representations out seems fine to me but I
>>> >>>>>> >>> don't think
>>> >>>>>> we
>>> >>>>>> >>> should necessitate effective 3NF for [1].
>>> >>>>>> >>>
>>> >>>>>> >>> One of the key points that I'm focused on in the Java space is
>>> >>>>>> >>> that I'd
>>> >>>>>> >>> like to move to an always nullable pattern. This is vastly
>>> >>>>>> >>> simplifying
>>> >>>>>> from
>>> >>>>>> >>> a code generation, casting and complexity perspective and is a
>>> >>>>>> >>> nominal
>>> >>>>>> cost
>>> >>>>>> >>> when using column execution. If binary and varchar are
>>> >>>>>> >>> primitive types
>>> >>>>>> as
>>> >>>>>> >>> there there is no weird special casing of avoiding the
>>> >>>>>> >>> nullability
>>> >>>>>> bitmap
>>> >>>>>> >>> in the case of variable width items (for the offsets). But
>>> >>>>>> >>> that is an
>>> >>>>>> >>> implementation detail of the Java library.
>>> >>>>>> >>>
>>> >>>>>> >>> So in general, I like the scheme at [1] for the concepts that
>>> >>>>>> >>> we all
>>> >>>>>> are
>>> >>>>>> >>> talking about (as opposed to eliminating lines 67 & 68)
>>> >>>>>> >>>
>>> >>>>>> >>> [1]
>>> >>>>>> >>> https://github.com/apache/arrow/blob/master/format/Message.fbs
>>> >>>>>> >>>
>>> >>>>>> >>
>>> >>>>>> >> Well, the issue is that mapping of metadata onto memory layout
>>> >>>>>> >> for IPC
>>> >>>>>> >> purposes, at least. You can use the List code path for
>>> >>>>>> >> arbitrary List
>>> >>>>>> >> types as well as strings and binary. It sounds like either way
>>> >>>>>> >> on the
>>> >>>>>> >> Java side you're going to collapse UTF8 / BINARY into a
>>> >>>>>> >> primitive so
>>> >>>>>> >> that you don't have to manage a separate never-used bitmap for
>>> >>>>>> >> the
>>> >>>>>> >> string/binary data. It seems useful enough to me to have a
>>> >>>>>> >> primitive
>>> >>>>>> >> variable-length binary/UTF8 type but I do not feel strongly
>>> >>>>>> >> about it.
>>> >>>>>> >>
>>> >>>>>> >>>
>>> >>>>>> >>>
>>> >>>>>> >>>> > For case 2, it would be nice to utilize the type system of
>>> >>>>>> >>>> > the host
>>> >>>>>> >>>> > programming language to express the semantics of a function
>>> >>>>>> >>>> > call
>>> >>>>>> (e.g.
>>> >>>>>> >>>> > ParseString(StringArray strings) vs ParseString(ListArray
>>> >>>>>> >>>> > strings),
>>> >>>>>> >>>> > but I think this can be implemented without requiring a new
>>> >>>>>> primitive
>>> >>>>>> >>>> > type in the spec.
>>> >>>>>> >>>> >
>>> >>>>>> >>>> > The more interesting thing to me is if we should have a new
>>> >>>>>> primitive
>>> >>>>>> >>>> > type for fixed length lists (e.g. the logical type CHAR).
>>> >>>>>> >>>> > The
>>> >>>>>> >>>> > offsets array isn't necessary in this case for random
>>> >>>>>> >>>> > access.
>>> >>>>>> >>>> >
>>> >>>>>> >>>> > Also, the way the VARCHAR types (based on a comment in the
>>> >>>>>> >>>> > C++
>>> >>>>>> >>>> > (
>>> >>>>>>
>>> >>>>>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L63)
>>> >>>>>> >>>> > are currently described as a null terminated UTF8 is
>>> >>>>>> >>>> > problematic.  I
>>> >>>>>> >>>> > believe null bytes are valid UTF8 characters.
>>> >>>>>> >>>>
>>> >>>>>> >>>>
>>> >>>>>> >>>> >
>>> >>>>>> >>>>
>>> >>>>>> >>>> Good point, sorry about that. We probably would need to
>>> >>>>>> >>>> length-prefix
>>> >>>>>> >>>> the values, then.
>>> >>>>>> >>>>
>>> >>>>>> >>>
>>> >>>>>> >>>
>>> >>>>>> >>> Is this an input/output interface? Arrow structures should all
>>> >>>>>> >>> be 4
>>> >>>>>> byte
>>> >>>>>> >>> offset based and be neither length prefixed nor null
>>> >>>>>> >>> terminated.
>>> >>>>>> >>
>>> >>>>>> >> This was a question around the VARCHAR(k) type (which in many
>>> >>>>>> >> databases is distinct from a TEXT type in which any value can
>>> >>>>>> >> be
>>> >>>>>> >> arbitrary length). So if you have a VARCHAR(50), you guarantee
>>> >>>>>> >> that no
>>> >>>>>> >> value exceeds 50 characters. In Arrow I suppose this is just
>>> >>>>>> >> metadata
>>> >>>>>> >> because you have the offsets encoding length (pardon the jet
>>> >>>>>> >> lag).
>>> >>>>>> >> Micah -- I think we can nix the `VarcharType` in the C++ code,
>>> >>>>>> >> leftovers from my earliest draft implementation.
>>> >>>>>> >>
>>> >>>>>> >> - Wes
>>> >>>>>>
>>
>>
>

Re: Some minor points from ARROW-94 (https://github.com/apache/arrow/pull/58)

Reply via email to