I created a PR to assist with further discussion: https://github.com/apache/arrow/pull/876
On Wed, Jul 19, 2017 at 4:14 PM, Julian Hyde <jh...@apache.org> wrote: > I see. It took me a while to understand, but it all made sense when I > realized that we are not looking at one Map instance but multiple > rows, each with a Map instance, and the constituents parts of those > Maps are stored end-to-end. > > Julian > > > On Wed, Jul 19, 2017 at 11:42 AM, Wes McKinney <wesmck...@gmail.com> wrote: >> The only structural difference between >> >> List<Struct<K, V>> >> >> and >> >> Struct<List<K>, List<V>> >> >> is that in the latter case, the "key" value and the "value" value have >> different offset vectors and thus can have different lengths. >> >> So in the first case we have buffer structure: >> >> - list null bitmap (map value is null / not null) >> - list offsets >> - key buffers (flattened) >> - value buffers >> >> in the second case we have 1 additional buffer >> >> - struct null bitmap (map value is null / not null) >> - key offsets >> - key buffers >> - value offsets >> - value buffers >> >> We can do a zero-copy transformation from the former to the latter by >> rearranging the buffers and reusing the list offsets buffer. In both >> cases the keys and values are all contiguous. >> >> I agree that sorting is very useful; the metadata for Map should have >> a field indicating whether or not the keys are sorted within each map >> value >> >> - Wes >> >> On Wed, Jul 19, 2017 at 1:37 PM, Julian Hyde <jh...@apache.org> wrote: >>> List<Struct<K, V>> isn’t the only physical representation that makes sense. >>> Because it doesn’t take advantage of the fact that (a) keys can be >>> re-ordered, (b) keys are unique. >>> >>> So, another viable physical representation would be Struct<List<K>, >>> List<V>>, with the keys sorted. If keys are constant width and in >>> contiguous memory then binary search is very fast. >>> >>> I am not claiming that this physical representation is better than yours. >>> But the fact that there is a more than one means it’s not a no-brainer. >>> >>> Julian >>> >>> >>>> On Jul 18, 2017, at 12:10 PM, Wes McKinney <wesmck...@gmail.com> wrote: >>>> >>>> I recently created https://issues.apache.org/jira/browse/ARROW-1207 >>>> and wanted to discuss on the mailing list to hear opinions about how >>>> to proceed. >>>> >>>> Some systems, like Spark [1], Presto [2], or Drill have a Map<K, V> >>>> composite type. These are sometimes stored in Parquet as a repeated >>>> struct, or in Arrow types List<item: Struct<key: K, value: V>>. >>>> >>>> While we can represent in-memory map data as List<Struct<K, V>>, it >>>> may be useful to add a new logical type to the set of supported >>>> logical types [3]. The idea is that the memory format between Map<K, >>>> V> and List<Struct<K, V>> is identical, so this is strictly a logical >>>> construct, similar to date/time values having the same in-memory >>>> format as the corresponding integer types (int32/int64) >>>> >>>> For Arrow implementation that do not provide a first class Map >>>> container, they could process the data as though it were a repeated >>>> struct. It would be helpful to us in C++ to have an arrow::MapArray >>>> container because we could convert to / from this type and other data >>>> structures like Python dictionaries. It would also be helpful to >>>> faithfully transport the MAP logical type from Parquet [4] >>>> >>>> Let me know what others think. One question I have is whether the >>>> repeated struct in-memory representation makes sense as the canonical >>>> map representation. >>>> >>>> Thanks >>>> Wes >>>> >>>> [1]: >>>> https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/types/MapType.html >>>> [2]: https://prestodb.io/docs/current/functions/map.html >>>> [3]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L157 >>>> [4]: >>>> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L63 >>>