On Fri, Apr 22, 2016 at 2:42 PM, Wes McKinney <w...@cloudera.com> wrote:
> On Fri, Apr 22, 2016 at 4:56 PM, Micah Kornfield <emkornfi...@gmail.com> > wrote: > > I like the current scheme of making String (UTF8) a primitive type in > > regards to RPC but not modeling it as a special Array type. I think > > the key is formally describing how logical types map to physical types > > either is the Flatbuffer schema or in a separate document. > > > > I think there are two use-cases here: > > 1. Reconstructing Array's off the wire. > > 2. Writing algorithms/builders to deal with specific logical types > > built on Arrays. > > > > For case 1, I think it is simpler to not special case string types as > > primitives. Understanding that a logical String type maps to a > > List<Utf8> should be sufficient and allows us to re-use the > > serialization code for ListArrays for these types. > > > > It is simpler for the IPC serde code-path. I'll let Jacques comment > but one downside of having strings as a nested type is that there are > certain code paths (for example: Parquet-related) which deal with the > flat table case. To make a Parquet analogy, there is the special > BYTE_ARRAY primitive type, even though you could technically represent > variable-length binary data using a repeated field and using > repetition/definition levels (but the encoding/decoding overhead for > this in Parquet is much more significant than Arrow). There may be > other reasons. > I'm a bit confused about what everyone means. I didn't actually realize that this [1] had been merged yet but I'm generally on board with how it is constructed. With regards to the c++ implementation of the items at [1], abstracting shared physical representations out seems fine to me but I don't think we should necessitate effective 3NF for [1]. One of the key points that I'm focused on in the Java space is that I'd like to move to an always nullable pattern. This is vastly simplifying from a code generation, casting and complexity perspective and is a nominal cost when using column execution. If binary and varchar are primitive types as there there is no weird special casing of avoiding the nullability bitmap in the case of variable width items (for the offsets). But that is an implementation detail of the Java library. So in general, I like the scheme at [1] for the concepts that we all are talking about (as opposed to eliminating lines 67 & 68) [1] https://github.com/apache/arrow/blob/master/format/Message.fbs > > For case 2, it would be nice to utilize the type system of the host > > programming language to express the semantics of a function call (e.g. > > ParseString(StringArray strings) vs ParseString(ListArray strings), > > but I think this can be implemented without requiring a new primitive > > type in the spec. > > > > The more interesting thing to me is if we should have a new primitive > > type for fixed length lists (e.g. the logical type CHAR). The > > offsets array isn't necessary in this case for random access. > > > > Also, the way the VARCHAR types (based on a comment in the C++ > > (https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L63) > > are currently described as a null terminated UTF8 is problematic. I > > believe null bytes are valid UTF8 characters. > > > > > > Good point, sorry about that. We probably would need to length-prefix > the values, then. > Is this an input/output interface? Arrow structures should all be 4 byte offset based and be neither length prefixed nor null terminated.