I recently created https://issues.apache.org/jira/browse/ARROW-1207
and wanted to discuss on the mailing list to hear opinions about how
to proceed.

Some systems, like Spark [1], Presto [2], or Drill have a Map<K, V>
composite type. These are sometimes stored in Parquet as a repeated
struct, or in Arrow types List<item: Struct<key: K, value: V>>.

While we can represent in-memory map data as List<Struct<K, V>>, it
may be useful to add a new logical type to the set of supported
logical types [3]. The idea is that the memory format between Map<K,
V> and List<Struct<K, V>> is identical, so this is strictly a logical
construct, similar to date/time values having the same in-memory
format as the corresponding integer types (int32/int64)

For Arrow implementation that do not provide a first class Map
container, they could process the data as though it were a repeated
struct. It would be helpful to us in C++ to have an arrow::MapArray
container because we could convert to / from this type and other data
structures like Python dictionaries. It would also be helpful to
faithfully transport the MAP logical type from Parquet [4]

Let me know what others think. One question I have is whether the
repeated struct in-memory representation makes sense as the canonical
map representation.

Thanks
Wes

[1]: 
https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/types/MapType.html
[2]: https://prestodb.io/docs/current/functions/map.html
[3]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L157
[4]: 
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L63

Reply via email to