There's 3 distinct issues here: 1) Physical memory representation 2) Metadata 3) Implementation details
On these 1) I think no one will argue that String/Binary have the same memory representation as List<uint8 [not-null]>, and regardless of the implementation that you can perform a zero-copy cast without copying or duplicating buffers, only changing the array container metadata. 2) I'm +1 on String/Binary being logically first-class primitive types, with the intent that they are not considered logically nested types (but you can perform the cast described in #1 if you want to get nested data without copying). 3) The C++ code sharing / duplication issue feels slightly orthogonal to the above two items, which are about user semantics and metadata. Effectively what would change is that std::dynamic_pointer_cast<ListArray>(string_data) would no longer be value, as in the class hierarchy, we would have - Primitive - Integer - Floating - String - ... - List - Struct - Union rather than the present - List - String (with the type metadata always set to List<uint8 [not-null]>) >From a coding point of view, I should think we would eventually want explicit casts that do not presume a certain C++ inheritance hierarchy, which might cause downstream code brittleness. Hard to predict this precisely at this moment. - Wes On Wed, Jul 13, 2016 at 10:28 PM, Micah Kornfield <emkornfi...@gmail.com> wrote: > Today String and Binary types are represented in memory as list<byte> [1] > and we use logical types to distinguish between a list of bytes and string > type [2]. > > The question of whether this is sufficient or if we should make a first > class string/binary type has come up tangentially on a few threads and we > should come try to come to a conclusion if we want to add it as part of a > spec. I think the current proposal is that the String type would consist > of null-bitmap buffer, an offset buffer and a buffer containing bytes (for > strings the bytes would be UTF-8 encoded strings). The main difference > with the list representation is, individual bytes cannot be marked as null > because there isn't a nested Array. > > To quote Jacques for the pros of this approach: > > My main argument is that the most basic types most people need come in > this order from my experience: > > Int > String > Float > Decimal > Binary > > Note that I'm not focused on width here, just generally "what people use". > So I think a string comes second in priority and ease of > use/approachability necessitate this as a first class concept. This is > beyond the fact that it has specialized rules that are separate from a > List<Byte>. > > > > The main argument for not doing this is it adds additional types that need > to be implemented and can lead to some amount of redundant code. For > instance, in the current C++ implementation we are able to have a String > Array that extends a List Type and re-use already defined equality methods > [3]. > > What do people think? > > Thanks, > Micah > > [1] https://github.com/apache/arrow/blob/master/format/Layout.md > [2] https://github.com/apache/arrow/blob/master/format/Message.fbs > [3] > https://github.com/apache/arrow/blob/master/cpp/src/arrow/types/string.h#L68