"hello world" makes sense as a good place to start for general IPC integration.
I thought there was still some disconnect on how strings were going to be represented. That was the basis for my suggestion above. But the integer use-case bypasses these concerns for now. On Wed, May 25, 2016 at 2:09 PM, Jacques Nadeau <jacq...@apache.org> wrote: > By usecase, I really meant "hello world" > > On Wed, May 25, 2016 at 2:09 PM, Jacques Nadeau <jacq...@apache.org> wrote: >> >> Let's start by creating a simple usecase. For example, I would start with >> nullable 4 byte integer, maybe and use the example of java > (col1) > python >> (or c++) > (newcol) > java that is one what I'd call a single batch >> algorithm (e.g. one batch of values in, one out). >> >> A simple way to sidestep the memory management/reference counting issues >> initially is for java to preallocate the output location for newcol for the >> python (or c++) code. >> >> On Wed, May 25, 2016 at 1:25 PM, Micah Kornfield <emkornfi...@gmail.com> >> wrote: >>> >>> Just to follow-up on this. I got distracted on a few other items on >>> the C++ implementation side, but my next task is to get a String types >>> working for the C++ IPC unit test. Once I send a PR for that, it >>> might help clarify the concerns on both sides and we can hammer out >>> the details from there. >>> >>> Sound reasonable? >>> >>> -Micah >>> >>> On Fri, May 13, 2016 at 10:33 AM, Wes McKinney <wesmck...@gmail.com> >>> wrote: >>> > Nudging this issue. We need to sketch out a plan to get IPC >>> > integration tests working between the Java and C++ implementations -- >>> > what's the most expedient way we can work toward making that happen? >>> > >>> > On Sun, May 1, 2016 at 1:02 AM, Micah Kornfield <emkornfi...@gmail.com> >>> > wrote: >>> >> s/spark/slack/g >>> >> >>> >> On Sun, May 1, 2016 at 12:58 AM, Micah Kornfield >>> >> <emkornfi...@gmail.com> wrote: >>> >>> I'm not exactly sure of my availability if I am available on spark, I >>> >>> can likely make the hangout. >>> >>> >>> >>> On Fri, Apr 29, 2016 at 4:40 PM, Wes McKinney <w...@cloudera.com> >>> >>> wrote: >>> >>>> I was traveling today but I can do a hangout about this next week. >>> >>>> >>> >>>> On Thu, Apr 28, 2016 at 7:53 PM, Jacques Nadeau <jacq...@apache.org> >>> >>>> wrote: >>> >>>>> Let's do a quick hangout on this. I'd like to better understand as >>> >>>>> I'm not >>> >>>>> sure we're all talking about the same thing. >>> >>>>> >>> >>>>> On Thu, Apr 28, 2016 at 5:30 PM, Micah Kornfield >>> >>>>> <emkornfi...@gmail.com> >>> >>>>> wrote: >>> >>>>> >>> >>>>>> I'm -1 on making a new primitive type in the memory layout spec >>> >>>>>> [1]. >>> >>>>>> >>> >>>>>> +1 on clarifying [2], to indicate it is expected that the "Values >>> >>>>>> array" for Utf8 and Binary types should never contain null >>> >>>>>> elements. >>> >>>>>> >>> >>>>>> [1] https://github.com/apache/arrow/blob/master/format/Layout.md >>> >>>>>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs >>> >>>>>> >>> >>>>>> On Thu, Apr 28, 2016 at 3:08 PM, Wes McKinney <w...@cloudera.com> >>> >>>>>> wrote: >>> >>>>>> > Bumping this conversation. >>> >>>>>> > >>> >>>>>> > I'm +0 on making VARBINARY and String (identical VARBINARY but >>> >>>>>> > with a >>> >>>>>> > UTF8 guarantee) primitive types in the spec. Let me know what >>> >>>>>> > others >>> >>>>>> > think. >>> >>>>>> > >>> >>>>>> > Thanks >>> >>>>>> > >>> >>>>>> > On Fri, Apr 22, 2016 at 6:30 PM, Wes McKinney <w...@cloudera.com> >>> >>>>>> > wrote: >>> >>>>>> >> On Fri, Apr 22, 2016 at 6:06 PM, Jacques Nadeau >>> >>>>>> >> <jacq...@apache.org> >>> >>>>>> wrote: >>> >>>>>> >>> On Fri, Apr 22, 2016 at 2:42 PM, Wes McKinney >>> >>>>>> >>> <w...@cloudera.com> >>> >>>>>> wrote: >>> >>>>>> >>> >>> >>>>>> >>>> On Fri, Apr 22, 2016 at 4:56 PM, Micah Kornfield < >>> >>>>>> emkornfi...@gmail.com> >>> >>>>>> >>>> wrote: >>> >>>>>> >>>> > I like the current scheme of making String (UTF8) a >>> >>>>>> >>>> > primitive type >>> >>>>>> in >>> >>>>>> >>>> > regards to RPC but not modeling it as a special Array type. >>> >>>>>> >>>> > I think >>> >>>>>> >>>> > the key is formally describing how logical types map to >>> >>>>>> >>>> > physical >>> >>>>>> types >>> >>>>>> >>>> > either is the Flatbuffer schema or in a separate document. >>> >>>>>> >>>> > >>> >>>>>> >>>> > I think there are two use-cases here: >>> >>>>>> >>>> > 1. Reconstructing Array's off the wire. >>> >>>>>> >>>> > 2. Writing algorithms/builders to deal with specific >>> >>>>>> >>>> > logical types >>> >>>>>> >>>> > built on Arrays. >>> >>>>>> >>>> > >>> >>>>>> >>>> > For case 1, I think it is simpler to not special case >>> >>>>>> >>>> > string types >>> >>>>>> as >>> >>>>>> >>>> > primitives. Understanding that a logical String type maps >>> >>>>>> >>>> > to a >>> >>>>>> >>>> > List<Utf8> should be sufficient and allows us to re-use the >>> >>>>>> >>>> > serialization code for ListArrays for these types. >>> >>>>>> >>>> > >>> >>>>>> >>>> >>> >>>>>> >>>> It is simpler for the IPC serde code-path. I'll let Jacques >>> >>>>>> >>>> comment >>> >>>>>> >>>> but one downside of having strings as a nested type is that >>> >>>>>> >>>> there are >>> >>>>>> >>>> certain code paths (for example: Parquet-related) which deal >>> >>>>>> >>>> with the >>> >>>>>> >>>> flat table case. To make a Parquet analogy, there is the >>> >>>>>> >>>> special >>> >>>>>> >>>> BYTE_ARRAY primitive type, even though you could technically >>> >>>>>> >>>> represent >>> >>>>>> >>>> variable-length binary data using a repeated field and using >>> >>>>>> >>>> repetition/definition levels (but the encoding/decoding >>> >>>>>> >>>> overhead for >>> >>>>>> >>>> this in Parquet is much more significant than Arrow). There >>> >>>>>> >>>> may be >>> >>>>>> >>>> other reasons. >>> >>>>>> >>>> >>> >>>>>> >>> >>> >>>>>> >>> I'm a bit confused about what everyone means. I didn't >>> >>>>>> >>> actually realize >>> >>>>>> >>> that this [1] had been merged yet but I'm generally on board >>> >>>>>> >>> with how >>> >>>>>> it is >>> >>>>>> >>> constructed. >>> >>>>>> >>> >>> >>>>>> >>> With regards to the c++ implementation of the items at [1], >>> >>>>>> >>> abstracting >>> >>>>>> >>> shared physical representations out seems fine to me but I >>> >>>>>> >>> don't think >>> >>>>>> we >>> >>>>>> >>> should necessitate effective 3NF for [1]. >>> >>>>>> >>> >>> >>>>>> >>> One of the key points that I'm focused on in the Java space is >>> >>>>>> >>> that I'd >>> >>>>>> >>> like to move to an always nullable pattern. This is vastly >>> >>>>>> >>> simplifying >>> >>>>>> from >>> >>>>>> >>> a code generation, casting and complexity perspective and is a >>> >>>>>> >>> nominal >>> >>>>>> cost >>> >>>>>> >>> when using column execution. If binary and varchar are >>> >>>>>> >>> primitive types >>> >>>>>> as >>> >>>>>> >>> there there is no weird special casing of avoiding the >>> >>>>>> >>> nullability >>> >>>>>> bitmap >>> >>>>>> >>> in the case of variable width items (for the offsets). But >>> >>>>>> >>> that is an >>> >>>>>> >>> implementation detail of the Java library. >>> >>>>>> >>> >>> >>>>>> >>> So in general, I like the scheme at [1] for the concepts that >>> >>>>>> >>> we all >>> >>>>>> are >>> >>>>>> >>> talking about (as opposed to eliminating lines 67 & 68) >>> >>>>>> >>> >>> >>>>>> >>> [1] >>> >>>>>> >>> https://github.com/apache/arrow/blob/master/format/Message.fbs >>> >>>>>> >>> >>> >>>>>> >> >>> >>>>>> >> Well, the issue is that mapping of metadata onto memory layout >>> >>>>>> >> for IPC >>> >>>>>> >> purposes, at least. You can use the List code path for >>> >>>>>> >> arbitrary List >>> >>>>>> >> types as well as strings and binary. It sounds like either way >>> >>>>>> >> on the >>> >>>>>> >> Java side you're going to collapse UTF8 / BINARY into a >>> >>>>>> >> primitive so >>> >>>>>> >> that you don't have to manage a separate never-used bitmap for >>> >>>>>> >> the >>> >>>>>> >> string/binary data. It seems useful enough to me to have a >>> >>>>>> >> primitive >>> >>>>>> >> variable-length binary/UTF8 type but I do not feel strongly >>> >>>>>> >> about it. >>> >>>>>> >> >>> >>>>>> >>> >>> >>>>>> >>> >>> >>>>>> >>>> > For case 2, it would be nice to utilize the type system of >>> >>>>>> >>>> > the host >>> >>>>>> >>>> > programming language to express the semantics of a function >>> >>>>>> >>>> > call >>> >>>>>> (e.g. >>> >>>>>> >>>> > ParseString(StringArray strings) vs ParseString(ListArray >>> >>>>>> >>>> > strings), >>> >>>>>> >>>> > but I think this can be implemented without requiring a new >>> >>>>>> primitive >>> >>>>>> >>>> > type in the spec. >>> >>>>>> >>>> > >>> >>>>>> >>>> > The more interesting thing to me is if we should have a new >>> >>>>>> primitive >>> >>>>>> >>>> > type for fixed length lists (e.g. the logical type CHAR). >>> >>>>>> >>>> > The >>> >>>>>> >>>> > offsets array isn't necessary in this case for random >>> >>>>>> >>>> > access. >>> >>>>>> >>>> > >>> >>>>>> >>>> > Also, the way the VARCHAR types (based on a comment in the >>> >>>>>> >>>> > C++ >>> >>>>>> >>>> > ( >>> >>>>>> >>> >>>>>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L63) >>> >>>>>> >>>> > are currently described as a null terminated UTF8 is >>> >>>>>> >>>> > problematic. I >>> >>>>>> >>>> > believe null bytes are valid UTF8 characters. >>> >>>>>> >>>> >>> >>>>>> >>>> >>> >>>>>> >>>> > >>> >>>>>> >>>> >>> >>>>>> >>>> Good point, sorry about that. We probably would need to >>> >>>>>> >>>> length-prefix >>> >>>>>> >>>> the values, then. >>> >>>>>> >>>> >>> >>>>>> >>> >>> >>>>>> >>> >>> >>>>>> >>> Is this an input/output interface? Arrow structures should all >>> >>>>>> >>> be 4 >>> >>>>>> byte >>> >>>>>> >>> offset based and be neither length prefixed nor null >>> >>>>>> >>> terminated. >>> >>>>>> >> >>> >>>>>> >> This was a question around the VARCHAR(k) type (which in many >>> >>>>>> >> databases is distinct from a TEXT type in which any value can >>> >>>>>> >> be >>> >>>>>> >> arbitrary length). So if you have a VARCHAR(50), you guarantee >>> >>>>>> >> that no >>> >>>>>> >> value exceeds 50 characters. In Arrow I suppose this is just >>> >>>>>> >> metadata >>> >>>>>> >> because you have the offsets encoding length (pardon the jet >>> >>>>>> >> lag). >>> >>>>>> >> Micah -- I think we can nix the `VarcharType` in the C++ code, >>> >>>>>> >> leftovers from my earliest draft implementation. >>> >>>>>> >> >>> >>>>>> >> - Wes >>> >>>>>> >> >> >