On Fri, Apr 22, 2016 at 4:56 PM, Micah Kornfield <emkornfi...@gmail.com> wrote: > I like the current scheme of making String (UTF8) a primitive type in > regards to RPC but not modeling it as a special Array type. I think > the key is formally describing how logical types map to physical types > either is the Flatbuffer schema or in a separate document. > > I think there are two use-cases here: > 1. Reconstructing Array's off the wire. > 2. Writing algorithms/builders to deal with specific logical types > built on Arrays. > > For case 1, I think it is simpler to not special case string types as > primitives. Understanding that a logical String type maps to a > List<Utf8> should be sufficient and allows us to re-use the > serialization code for ListArrays for these types. >
It is simpler for the IPC serde code-path. I'll let Jacques comment but one downside of having strings as a nested type is that there are certain code paths (for example: Parquet-related) which deal with the flat table case. To make a Parquet analogy, there is the special BYTE_ARRAY primitive type, even though you could technically represent variable-length binary data using a repeated field and using repetition/definition levels (but the encoding/decoding overhead for this in Parquet is much more significant than Arrow). There may be other reasons. > For case 2, it would be nice to utilize the type system of the host > programming language to express the semantics of a function call (e.g. > ParseString(StringArray strings) vs ParseString(ListArray strings), > but I think this can be implemented without requiring a new primitive > type in the spec. > > The more interesting thing to me is if we should have a new primitive > type for fixed length lists (e.g. the logical type CHAR). The > offsets array isn't necessary in this case for random access. > > Also, the way the VARCHAR types (based on a comment in the C++ > (https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L63) > are currently described as a null terminated UTF8 is problematic. I > believe null bytes are valid UTF8 characters. > > Good point, sorry about that. We probably would need to length-prefix the values, then.