Just to follow-up on this. I got distracted on a few other items on the C++ implementation side, but my next task is to get a String types working for the C++ IPC unit test. Once I send a PR for that, it might help clarify the concerns on both sides and we can hammer out the details from there.
Sound reasonable? -Micah On Fri, May 13, 2016 at 10:33 AM, Wes McKinney <wesmck...@gmail.com> wrote: > Nudging this issue. We need to sketch out a plan to get IPC > integration tests working between the Java and C++ implementations -- > what's the most expedient way we can work toward making that happen? > > On Sun, May 1, 2016 at 1:02 AM, Micah Kornfield <emkornfi...@gmail.com> wrote: >> s/spark/slack/g >> >> On Sun, May 1, 2016 at 12:58 AM, Micah Kornfield <emkornfi...@gmail.com> >> wrote: >>> I'm not exactly sure of my availability if I am available on spark, I >>> can likely make the hangout. >>> >>> On Fri, Apr 29, 2016 at 4:40 PM, Wes McKinney <w...@cloudera.com> wrote: >>>> I was traveling today but I can do a hangout about this next week. >>>> >>>> On Thu, Apr 28, 2016 at 7:53 PM, Jacques Nadeau <jacq...@apache.org> wrote: >>>>> Let's do a quick hangout on this. I'd like to better understand as I'm not >>>>> sure we're all talking about the same thing. >>>>> >>>>> On Thu, Apr 28, 2016 at 5:30 PM, Micah Kornfield <emkornfi...@gmail.com> >>>>> wrote: >>>>> >>>>>> I'm -1 on making a new primitive type in the memory layout spec [1]. >>>>>> >>>>>> +1 on clarifying [2], to indicate it is expected that the "Values >>>>>> array" for Utf8 and Binary types should never contain null elements. >>>>>> >>>>>> [1] https://github.com/apache/arrow/blob/master/format/Layout.md >>>>>> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs >>>>>> >>>>>> On Thu, Apr 28, 2016 at 3:08 PM, Wes McKinney <w...@cloudera.com> wrote: >>>>>> > Bumping this conversation. >>>>>> > >>>>>> > I'm +0 on making VARBINARY and String (identical VARBINARY but with a >>>>>> > UTF8 guarantee) primitive types in the spec. Let me know what others >>>>>> > think. >>>>>> > >>>>>> > Thanks >>>>>> > >>>>>> > On Fri, Apr 22, 2016 at 6:30 PM, Wes McKinney <w...@cloudera.com> >>>>>> > wrote: >>>>>> >> On Fri, Apr 22, 2016 at 6:06 PM, Jacques Nadeau <jacq...@apache.org> >>>>>> wrote: >>>>>> >>> On Fri, Apr 22, 2016 at 2:42 PM, Wes McKinney <w...@cloudera.com> >>>>>> wrote: >>>>>> >>> >>>>>> >>>> On Fri, Apr 22, 2016 at 4:56 PM, Micah Kornfield < >>>>>> emkornfi...@gmail.com> >>>>>> >>>> wrote: >>>>>> >>>> > I like the current scheme of making String (UTF8) a primitive type >>>>>> in >>>>>> >>>> > regards to RPC but not modeling it as a special Array type. I >>>>>> >>>> > think >>>>>> >>>> > the key is formally describing how logical types map to physical >>>>>> types >>>>>> >>>> > either is the Flatbuffer schema or in a separate document. >>>>>> >>>> > >>>>>> >>>> > I think there are two use-cases here: >>>>>> >>>> > 1. Reconstructing Array's off the wire. >>>>>> >>>> > 2. Writing algorithms/builders to deal with specific logical >>>>>> >>>> > types >>>>>> >>>> > built on Arrays. >>>>>> >>>> > >>>>>> >>>> > For case 1, I think it is simpler to not special case string types >>>>>> as >>>>>> >>>> > primitives. Understanding that a logical String type maps to a >>>>>> >>>> > List<Utf8> should be sufficient and allows us to re-use the >>>>>> >>>> > serialization code for ListArrays for these types. >>>>>> >>>> > >>>>>> >>>> >>>>>> >>>> It is simpler for the IPC serde code-path. I'll let Jacques comment >>>>>> >>>> but one downside of having strings as a nested type is that there >>>>>> >>>> are >>>>>> >>>> certain code paths (for example: Parquet-related) which deal with >>>>>> >>>> the >>>>>> >>>> flat table case. To make a Parquet analogy, there is the special >>>>>> >>>> BYTE_ARRAY primitive type, even though you could technically >>>>>> >>>> represent >>>>>> >>>> variable-length binary data using a repeated field and using >>>>>> >>>> repetition/definition levels (but the encoding/decoding overhead for >>>>>> >>>> this in Parquet is much more significant than Arrow). There may be >>>>>> >>>> other reasons. >>>>>> >>>> >>>>>> >>> >>>>>> >>> I'm a bit confused about what everyone means. I didn't actually >>>>>> >>> realize >>>>>> >>> that this [1] had been merged yet but I'm generally on board with how >>>>>> it is >>>>>> >>> constructed. >>>>>> >>> >>>>>> >>> With regards to the c++ implementation of the items at [1], >>>>>> >>> abstracting >>>>>> >>> shared physical representations out seems fine to me but I don't >>>>>> >>> think >>>>>> we >>>>>> >>> should necessitate effective 3NF for [1]. >>>>>> >>> >>>>>> >>> One of the key points that I'm focused on in the Java space is that >>>>>> >>> I'd >>>>>> >>> like to move to an always nullable pattern. This is vastly >>>>>> >>> simplifying >>>>>> from >>>>>> >>> a code generation, casting and complexity perspective and is a >>>>>> >>> nominal >>>>>> cost >>>>>> >>> when using column execution. If binary and varchar are primitive >>>>>> >>> types >>>>>> as >>>>>> >>> there there is no weird special casing of avoiding the nullability >>>>>> bitmap >>>>>> >>> in the case of variable width items (for the offsets). But that is an >>>>>> >>> implementation detail of the Java library. >>>>>> >>> >>>>>> >>> So in general, I like the scheme at [1] for the concepts that we all >>>>>> are >>>>>> >>> talking about (as opposed to eliminating lines 67 & 68) >>>>>> >>> >>>>>> >>> [1] https://github.com/apache/arrow/blob/master/format/Message.fbs >>>>>> >>> >>>>>> >> >>>>>> >> Well, the issue is that mapping of metadata onto memory layout for IPC >>>>>> >> purposes, at least. You can use the List code path for arbitrary List >>>>>> >> types as well as strings and binary. It sounds like either way on the >>>>>> >> Java side you're going to collapse UTF8 / BINARY into a primitive so >>>>>> >> that you don't have to manage a separate never-used bitmap for the >>>>>> >> string/binary data. It seems useful enough to me to have a primitive >>>>>> >> variable-length binary/UTF8 type but I do not feel strongly about it. >>>>>> >> >>>>>> >>> >>>>>> >>> >>>>>> >>>> > For case 2, it would be nice to utilize the type system of the >>>>>> >>>> > host >>>>>> >>>> > programming language to express the semantics of a function call >>>>>> (e.g. >>>>>> >>>> > ParseString(StringArray strings) vs ParseString(ListArray >>>>>> >>>> > strings), >>>>>> >>>> > but I think this can be implemented without requiring a new >>>>>> primitive >>>>>> >>>> > type in the spec. >>>>>> >>>> > >>>>>> >>>> > The more interesting thing to me is if we should have a new >>>>>> primitive >>>>>> >>>> > type for fixed length lists (e.g. the logical type CHAR). The >>>>>> >>>> > offsets array isn't necessary in this case for random access. >>>>>> >>>> > >>>>>> >>>> > Also, the way the VARCHAR types (based on a comment in the C++ >>>>>> >>>> > ( >>>>>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L63) >>>>>> >>>> > are currently described as a null terminated UTF8 is problematic. >>>>>> >>>> > I >>>>>> >>>> > believe null bytes are valid UTF8 characters. >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> > >>>>>> >>>> >>>>>> >>>> Good point, sorry about that. We probably would need to >>>>>> >>>> length-prefix >>>>>> >>>> the values, then. >>>>>> >>>> >>>>>> >>> >>>>>> >>> >>>>>> >>> Is this an input/output interface? Arrow structures should all be 4 >>>>>> byte >>>>>> >>> offset based and be neither length prefixed nor null terminated. >>>>>> >> >>>>>> >> This was a question around the VARCHAR(k) type (which in many >>>>>> >> databases is distinct from a TEXT type in which any value can be >>>>>> >> arbitrary length). So if you have a VARCHAR(50), you guarantee that no >>>>>> >> value exceeds 50 characters. In Arrow I suppose this is just metadata >>>>>> >> because you have the offsets encoding length (pardon the jet lag). >>>>>> >> Micah -- I think we can nix the `VarcharType` in the C++ code, >>>>>> >> leftovers from my earliest draft implementation. >>>>>> >> >>>>>> >> - Wes >>>>>>