By usecase, I really meant "hello world" On Wed, May 25, 2016 at 2:09 PM, Jacques Nadeau <jacq...@apache.org> wrote:
> Let's start by creating a simple usecase. For example, I would start with > nullable 4 byte integer, maybe and use the example of java > (col1) > > python (or c++) > (newcol) > java that is one what I'd call a single batch > algorithm (e.g. one batch of values in, one out). > > A simple way to sidestep the memory management/reference counting issues > initially is for java to preallocate the output location for newcol for the > python (or c++) code. > > On Wed, May 25, 2016 at 1:25 PM, Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> Just to follow-up on this. I got distracted on a few other items on >> the C++ implementation side, but my next task is to get a String types >> working for the C++ IPC unit test. Once I send a PR for that, it >> might help clarify the concerns on both sides and we can hammer out >> the details from there. >> >> Sound reasonable? >> >> -Micah >> >> On Fri, May 13, 2016 at 10:33 AM, Wes McKinney <wesmck...@gmail.com> >> wrote: >> > Nudging this issue. We need to sketch out a plan to get IPC >> > integration tests working between the Java and C++ implementations -- >> > what's the most expedient way we can work toward making that happen? >> > >> > On Sun, May 1, 2016 at 1:02 AM, Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >> s/spark/slack/g >> >> >> >> On Sun, May 1, 2016 at 12:58 AM, Micah Kornfield < >> emkornfi...@gmail.com> wrote: >> >>> I'm not exactly sure of my availability if I am available on spark, I >> >>> can likely make the hangout. >> >>> >> >>> On Fri, Apr 29, 2016 at 4:40 PM, Wes McKinney <w...@cloudera.com> >> wrote: >> >>>> I was traveling today but I can do a hangout about this next week. >> >>>> >> >>>> On Thu, Apr 28, 2016 at 7:53 PM, Jacques Nadeau <jacq...@apache.org> >> wrote: >> >>>>> Let's do a quick hangout on this. I'd like to better understand as >> I'm not >> >>>>> sure we're all talking about the same thing. >> >>>>> >> >>>>> On Thu, Apr 28, 2016 at 5:30 PM, Micah Kornfield < >> emkornfi...@gmail.com> >> >>>>> wrote: >> >>>>> >> >>>>>> I'm -1 on making a new primitive type in the memory layout spec >> [1]. >> >>>>>> >> >>>>>> +1 on clarifying [2], to indicate it is expected that the "Values >> >>>>>> array" for Utf8 and Binary types should never contain null >> elements. >> >>>>>> >> >>>>>> [1] https://github.com/apache/arrow/blob/master/format/Layout.md >> >>>>>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs >> >>>>>> >> >>>>>> On Thu, Apr 28, 2016 at 3:08 PM, Wes McKinney <w...@cloudera.com> >> wrote: >> >>>>>> > Bumping this conversation. >> >>>>>> > >> >>>>>> > I'm +0 on making VARBINARY and String (identical VARBINARY but >> with a >> >>>>>> > UTF8 guarantee) primitive types in the spec. Let me know what >> others >> >>>>>> > think. >> >>>>>> > >> >>>>>> > Thanks >> >>>>>> > >> >>>>>> > On Fri, Apr 22, 2016 at 6:30 PM, Wes McKinney <w...@cloudera.com> >> wrote: >> >>>>>> >> On Fri, Apr 22, 2016 at 6:06 PM, Jacques Nadeau < >> jacq...@apache.org> >> >>>>>> wrote: >> >>>>>> >>> On Fri, Apr 22, 2016 at 2:42 PM, Wes McKinney < >> w...@cloudera.com> >> >>>>>> wrote: >> >>>>>> >>> >> >>>>>> >>>> On Fri, Apr 22, 2016 at 4:56 PM, Micah Kornfield < >> >>>>>> emkornfi...@gmail.com> >> >>>>>> >>>> wrote: >> >>>>>> >>>> > I like the current scheme of making String (UTF8) a >> primitive type >> >>>>>> in >> >>>>>> >>>> > regards to RPC but not modeling it as a special Array >> type. I think >> >>>>>> >>>> > the key is formally describing how logical types map to >> physical >> >>>>>> types >> >>>>>> >>>> > either is the Flatbuffer schema or in a separate document. >> >>>>>> >>>> > >> >>>>>> >>>> > I think there are two use-cases here: >> >>>>>> >>>> > 1. Reconstructing Array's off the wire. >> >>>>>> >>>> > 2. Writing algorithms/builders to deal with specific >> logical types >> >>>>>> >>>> > built on Arrays. >> >>>>>> >>>> > >> >>>>>> >>>> > For case 1, I think it is simpler to not special case >> string types >> >>>>>> as >> >>>>>> >>>> > primitives. Understanding that a logical String type maps >> to a >> >>>>>> >>>> > List<Utf8> should be sufficient and allows us to re-use the >> >>>>>> >>>> > serialization code for ListArrays for these types. >> >>>>>> >>>> > >> >>>>>> >>>> >> >>>>>> >>>> It is simpler for the IPC serde code-path. I'll let Jacques >> comment >> >>>>>> >>>> but one downside of having strings as a nested type is that >> there are >> >>>>>> >>>> certain code paths (for example: Parquet-related) which deal >> with the >> >>>>>> >>>> flat table case. To make a Parquet analogy, there is the >> special >> >>>>>> >>>> BYTE_ARRAY primitive type, even though you could technically >> represent >> >>>>>> >>>> variable-length binary data using a repeated field and using >> >>>>>> >>>> repetition/definition levels (but the encoding/decoding >> overhead for >> >>>>>> >>>> this in Parquet is much more significant than Arrow). There >> may be >> >>>>>> >>>> other reasons. >> >>>>>> >>>> >> >>>>>> >>> >> >>>>>> >>> I'm a bit confused about what everyone means. I didn't >> actually realize >> >>>>>> >>> that this [1] had been merged yet but I'm generally on board >> with how >> >>>>>> it is >> >>>>>> >>> constructed. >> >>>>>> >>> >> >>>>>> >>> With regards to the c++ implementation of the items at [1], >> abstracting >> >>>>>> >>> shared physical representations out seems fine to me but I >> don't think >> >>>>>> we >> >>>>>> >>> should necessitate effective 3NF for [1]. >> >>>>>> >>> >> >>>>>> >>> One of the key points that I'm focused on in the Java space is >> that I'd >> >>>>>> >>> like to move to an always nullable pattern. This is vastly >> simplifying >> >>>>>> from >> >>>>>> >>> a code generation, casting and complexity perspective and is a >> nominal >> >>>>>> cost >> >>>>>> >>> when using column execution. If binary and varchar are >> primitive types >> >>>>>> as >> >>>>>> >>> there there is no weird special casing of avoiding the >> nullability >> >>>>>> bitmap >> >>>>>> >>> in the case of variable width items (for the offsets). But >> that is an >> >>>>>> >>> implementation detail of the Java library. >> >>>>>> >>> >> >>>>>> >>> So in general, I like the scheme at [1] for the concepts that >> we all >> >>>>>> are >> >>>>>> >>> talking about (as opposed to eliminating lines 67 & 68) >> >>>>>> >>> >> >>>>>> >>> [1] >> https://github.com/apache/arrow/blob/master/format/Message.fbs >> >>>>>> >>> >> >>>>>> >> >> >>>>>> >> Well, the issue is that mapping of metadata onto memory layout >> for IPC >> >>>>>> >> purposes, at least. You can use the List code path for >> arbitrary List >> >>>>>> >> types as well as strings and binary. It sounds like either way >> on the >> >>>>>> >> Java side you're going to collapse UTF8 / BINARY into a >> primitive so >> >>>>>> >> that you don't have to manage a separate never-used bitmap for >> the >> >>>>>> >> string/binary data. It seems useful enough to me to have a >> primitive >> >>>>>> >> variable-length binary/UTF8 type but I do not feel strongly >> about it. >> >>>>>> >> >> >>>>>> >>> >> >>>>>> >>> >> >>>>>> >>>> > For case 2, it would be nice to utilize the type system of >> the host >> >>>>>> >>>> > programming language to express the semantics of a function >> call >> >>>>>> (e.g. >> >>>>>> >>>> > ParseString(StringArray strings) vs ParseString(ListArray >> strings), >> >>>>>> >>>> > but I think this can be implemented without requiring a new >> >>>>>> primitive >> >>>>>> >>>> > type in the spec. >> >>>>>> >>>> > >> >>>>>> >>>> > The more interesting thing to me is if we should have a new >> >>>>>> primitive >> >>>>>> >>>> > type for fixed length lists (e.g. the logical type CHAR). >> The >> >>>>>> >>>> > offsets array isn't necessary in this case for random >> access. >> >>>>>> >>>> > >> >>>>>> >>>> > Also, the way the VARCHAR types (based on a comment in the >> C++ >> >>>>>> >>>> > ( >> >>>>>> >> https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L63) >> >>>>>> >>>> > are currently described as a null terminated UTF8 is >> problematic. I >> >>>>>> >>>> > believe null bytes are valid UTF8 characters. >> >>>>>> >>>> >> >>>>>> >>>> >> >>>>>> >>>> > >> >>>>>> >>>> >> >>>>>> >>>> Good point, sorry about that. We probably would need to >> length-prefix >> >>>>>> >>>> the values, then. >> >>>>>> >>>> >> >>>>>> >>> >> >>>>>> >>> >> >>>>>> >>> Is this an input/output interface? Arrow structures should all >> be 4 >> >>>>>> byte >> >>>>>> >>> offset based and be neither length prefixed nor null >> terminated. >> >>>>>> >> >> >>>>>> >> This was a question around the VARCHAR(k) type (which in many >> >>>>>> >> databases is distinct from a TEXT type in which any value can be >> >>>>>> >> arbitrary length). So if you have a VARCHAR(50), you guarantee >> that no >> >>>>>> >> value exceeds 50 characters. In Arrow I suppose this is just >> metadata >> >>>>>> >> because you have the offsets encoding length (pardon the jet >> lag). >> >>>>>> >> Micah -- I think we can nix the `VarcharType` in the C++ code, >> >>>>>> >> leftovers from my earliest draft implementation. >> >>>>>> >> >> >>>>>> >> - Wes >> >>>>>> >> > >