I'm +1 on what Wes said. (and +1 on what I said... jk :)
I'll actually be offline most of next week but would like to continue to be part of the discussion so I'll do my best to try to check in but let's hold on making any formal decisions until next week if that is okay.. -j On Fri, Jul 15, 2016 at 8:19 AM, Wes McKinney <wesmck...@gmail.com> wrote: > There's 3 distinct issues here: > > 1) Physical memory representation > 2) Metadata > 3) Implementation details > > On these > > 1) I think no one will argue that String/Binary have the same memory > representation as List<uint8 [not-null]>, and regardless of the > implementation that you can perform a zero-copy cast without copying > or duplicating buffers, only changing the array container metadata. > > 2) I'm +1 on String/Binary being logically first-class primitive > types, with the intent that they are not considered logically nested > types (but you can perform the cast described in #1 if you want to get > nested data without copying). > > 3) The C++ code sharing / duplication issue feels slightly orthogonal > to the above two items, which are about user semantics and metadata. > Effectively what would change is that > std::dynamic_pointer_cast<ListArray>(string_data) would no longer be > value, as in the class hierarchy, we would have > > > - Primitive > - Integer > - Floating > - String > - ... > - List > - Struct > - Union > > rather than the present > > - List > - String (with the type metadata always set to List<uint8 [not-null]>) > > From a coding point of view, I should think we would eventually want > explicit casts that do not presume a certain C++ inheritance > hierarchy, which might cause downstream code brittleness. Hard to > predict this precisely at this moment. > > - Wes > > On Wed, Jul 13, 2016 at 10:28 PM, Micah Kornfield <emkornfi...@gmail.com> > wrote: > > Today String and Binary types are represented in memory as list<byte> [1] > > and we use logical types to distinguish between a list of bytes and > string > > type [2]. > > > > The question of whether this is sufficient or if we should make a first > > class string/binary type has come up tangentially on a few threads and we > > should come try to come to a conclusion if we want to add it as part of a > > spec. I think the current proposal is that the String type would > consist > > of null-bitmap buffer, an offset buffer and a buffer containing bytes > (for > > strings the bytes would be UTF-8 encoded strings). The main difference > > with the list representation is, individual bytes cannot be marked as > null > > because there isn't a nested Array. > > > > To quote Jacques for the pros of this approach: > > > > My main argument is that the most basic types most people need come in > > this order from my experience: > > > > Int > > String > > Float > > Decimal > > Binary > > > > Note that I'm not focused on width here, just generally "what people > use". > > So I think a string comes second in priority and ease of > > use/approachability necessitate this as a first class concept. This is > > beyond the fact that it has specialized rules that are separate from a > > List<Byte>. > > > > > > > > The main argument for not doing this is it adds additional types that > need > > to be implemented and can lead to some amount of redundant code. For > > instance, in the current C++ implementation we are able to have a String > > Array that extends a List Type and re-use already defined equality > methods > > [3]. > > > > What do people think? > > > > Thanks, > > Micah > > > > [1] https://github.com/apache/arrow/blob/master/format/Layout.md > > [2] https://github.com/apache/arrow/blob/master/format/Message.fbs > > [3] > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/types/string.h#L68 >