I see. It took me a while to understand, but it all made sense when I realized that we are not looking at one Map instance but multiple rows, each with a Map instance, and the constituents parts of those Maps are stored end-to-end.
Julian On Wed, Jul 19, 2017 at 11:42 AM, Wes McKinney <wesmck...@gmail.com> wrote: > The only structural difference between > > List<Struct<K, V>> > > and > > Struct<List<K>, List<V>> > > is that in the latter case, the "key" value and the "value" value have > different offset vectors and thus can have different lengths. > > So in the first case we have buffer structure: > > - list null bitmap (map value is null / not null) > - list offsets > - key buffers (flattened) > - value buffers > > in the second case we have 1 additional buffer > > - struct null bitmap (map value is null / not null) > - key offsets > - key buffers > - value offsets > - value buffers > > We can do a zero-copy transformation from the former to the latter by > rearranging the buffers and reusing the list offsets buffer. In both > cases the keys and values are all contiguous. > > I agree that sorting is very useful; the metadata for Map should have > a field indicating whether or not the keys are sorted within each map > value > > - Wes > > On Wed, Jul 19, 2017 at 1:37 PM, Julian Hyde <jh...@apache.org> wrote: >> List<Struct<K, V>> isn’t the only physical representation that makes sense. >> Because it doesn’t take advantage of the fact that (a) keys can be >> re-ordered, (b) keys are unique. >> >> So, another viable physical representation would be Struct<List<K>, >> List<V>>, with the keys sorted. If keys are constant width and in contiguous >> memory then binary search is very fast. >> >> I am not claiming that this physical representation is better than yours. >> But the fact that there is a more than one means it’s not a no-brainer. >> >> Julian >> >> >>> On Jul 18, 2017, at 12:10 PM, Wes McKinney <wesmck...@gmail.com> wrote: >>> >>> I recently created https://issues.apache.org/jira/browse/ARROW-1207 >>> and wanted to discuss on the mailing list to hear opinions about how >>> to proceed. >>> >>> Some systems, like Spark [1], Presto [2], or Drill have a Map<K, V> >>> composite type. These are sometimes stored in Parquet as a repeated >>> struct, or in Arrow types List<item: Struct<key: K, value: V>>. >>> >>> While we can represent in-memory map data as List<Struct<K, V>>, it >>> may be useful to add a new logical type to the set of supported >>> logical types [3]. The idea is that the memory format between Map<K, >>> V> and List<Struct<K, V>> is identical, so this is strictly a logical >>> construct, similar to date/time values having the same in-memory >>> format as the corresponding integer types (int32/int64) >>> >>> For Arrow implementation that do not provide a first class Map >>> container, they could process the data as though it were a repeated >>> struct. It would be helpful to us in C++ to have an arrow::MapArray >>> container because we could convert to / from this type and other data >>> structures like Python dictionaries. It would also be helpful to >>> faithfully transport the MAP logical type from Parquet [4] >>> >>> Let me know what others think. One question I have is whether the >>> repeated struct in-memory representation makes sense as the canonical >>> map representation. >>> >>> Thanks >>> Wes >>> >>> [1]: >>> https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/types/MapType.html >>> [2]: https://prestodb.io/docs/current/functions/map.html >>> [3]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L157 >>> [4]: >>> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L63 >>