Today String and Binary types are represented in memory as list<byte> [1]
 and we use logical types to distinguish between a list of bytes and string
type [2].

The question of whether this is sufficient or if we should make a first
class string/binary type has come up tangentially on a few threads and we
should come try to come to a conclusion if we want to add it as part of a
spec.   I think the current proposal is that the String type would consist
of null-bitmap buffer, an offset buffer and a buffer containing bytes (for
strings the bytes would be UTF-8 encoded strings).  The main difference
with the list representation is, individual bytes cannot be marked as null
because there isn't a nested Array.

To quote Jacques for the pros of this approach:

 My main argument is that the most basic types most people need come in
this order from my experience:

Int
String
Float
Decimal
Binary

Note that I'm not focused on width here, just generally "what people use".
So I think a string comes second in priority and ease of
use/approachability necessitate this as a first class concept. This is
beyond the fact that it has specialized rules that are separate from a
List<Byte>.



The main argument for not doing this is it adds additional types that need
to be implemented and can lead to some amount of redundant code.  For
instance, in the current C++ implementation we are able to have a String
Array that extends a List Type and re-use already defined equality methods
[3].

What do people think?

Thanks,
Micah

[1] https://github.com/apache/arrow/blob/master/format/Layout.md
[2] https://github.com/apache/arrow/blob/master/format/Message.fbs
[3]
https://github.com/apache/arrow/blob/master/cpp/src/arrow/types/string.h#L68

Reply via email to