Bumping this conversation. I'm +0 on making VARBINARY and String (identical VARBINARY but with a UTF8 guarantee) primitive types in the spec. Let me know what others think.
Thanks On Fri, Apr 22, 2016 at 6:30 PM, Wes McKinney <w...@cloudera.com> wrote: > On Fri, Apr 22, 2016 at 6:06 PM, Jacques Nadeau <jacq...@apache.org> wrote: >> On Fri, Apr 22, 2016 at 2:42 PM, Wes McKinney <w...@cloudera.com> wrote: >> >>> On Fri, Apr 22, 2016 at 4:56 PM, Micah Kornfield <emkornfi...@gmail.com> >>> wrote: >>> > I like the current scheme of making String (UTF8) a primitive type in >>> > regards to RPC but not modeling it as a special Array type. I think >>> > the key is formally describing how logical types map to physical types >>> > either is the Flatbuffer schema or in a separate document. >>> > >>> > I think there are two use-cases here: >>> > 1. Reconstructing Array's off the wire. >>> > 2. Writing algorithms/builders to deal with specific logical types >>> > built on Arrays. >>> > >>> > For case 1, I think it is simpler to not special case string types as >>> > primitives. Understanding that a logical String type maps to a >>> > List<Utf8> should be sufficient and allows us to re-use the >>> > serialization code for ListArrays for these types. >>> > >>> >>> It is simpler for the IPC serde code-path. I'll let Jacques comment >>> but one downside of having strings as a nested type is that there are >>> certain code paths (for example: Parquet-related) which deal with the >>> flat table case. To make a Parquet analogy, there is the special >>> BYTE_ARRAY primitive type, even though you could technically represent >>> variable-length binary data using a repeated field and using >>> repetition/definition levels (but the encoding/decoding overhead for >>> this in Parquet is much more significant than Arrow). There may be >>> other reasons. >>> >> >> I'm a bit confused about what everyone means. I didn't actually realize >> that this [1] had been merged yet but I'm generally on board with how it is >> constructed. >> >> With regards to the c++ implementation of the items at [1], abstracting >> shared physical representations out seems fine to me but I don't think we >> should necessitate effective 3NF for [1]. >> >> One of the key points that I'm focused on in the Java space is that I'd >> like to move to an always nullable pattern. This is vastly simplifying from >> a code generation, casting and complexity perspective and is a nominal cost >> when using column execution. If binary and varchar are primitive types as >> there there is no weird special casing of avoiding the nullability bitmap >> in the case of variable width items (for the offsets). But that is an >> implementation detail of the Java library. >> >> So in general, I like the scheme at [1] for the concepts that we all are >> talking about (as opposed to eliminating lines 67 & 68) >> >> [1] https://github.com/apache/arrow/blob/master/format/Message.fbs >> > > Well, the issue is that mapping of metadata onto memory layout for IPC > purposes, at least. You can use the List code path for arbitrary List > types as well as strings and binary. It sounds like either way on the > Java side you're going to collapse UTF8 / BINARY into a primitive so > that you don't have to manage a separate never-used bitmap for the > string/binary data. It seems useful enough to me to have a primitive > variable-length binary/UTF8 type but I do not feel strongly about it. > >> >> >>> > For case 2, it would be nice to utilize the type system of the host >>> > programming language to express the semantics of a function call (e.g. >>> > ParseString(StringArray strings) vs ParseString(ListArray strings), >>> > but I think this can be implemented without requiring a new primitive >>> > type in the spec. >>> > >>> > The more interesting thing to me is if we should have a new primitive >>> > type for fixed length lists (e.g. the logical type CHAR). The >>> > offsets array isn't necessary in this case for random access. >>> > >>> > Also, the way the VARCHAR types (based on a comment in the C++ >>> > (https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L63) >>> > are currently described as a null terminated UTF8 is problematic. I >>> > believe null bytes are valid UTF8 characters. >>> >>> >>> > >>> >>> Good point, sorry about that. We probably would need to length-prefix >>> the values, then. >>> >> >> >> Is this an input/output interface? Arrow structures should all be 4 byte >> offset based and be neither length prefixed nor null terminated. > > This was a question around the VARCHAR(k) type (which in many > databases is distinct from a TEXT type in which any value can be > arbitrary length). So if you have a VARCHAR(50), you guarantee that no > value exceeds 50 characters. In Arrow I suppose this is just metadata > because you have the offsets encoding length (pardon the jet lag). > Micah -- I think we can nix the `VarcharType` in the C++ code, > leftovers from my earliest draft implementation. > > - Wes