I see. It took me a while to understand, but it all made sense when I
realized that we are not looking at one Map instance but multiple
rows, each with a Map instance, and the constituents parts of those
Maps are stored end-to-end.

Julian


On Wed, Jul 19, 2017 at 11:42 AM, Wes McKinney <wesmck...@gmail.com> wrote:
> The only structural difference between
>
> List<Struct<K, V>>
>
> and
>
> Struct<List<K>, List<V>>
>
> is that in the latter case, the "key" value and the "value" value have
> different offset vectors and thus can have different lengths.
>
> So in the first case we have buffer structure:
>
> - list null bitmap (map value is null / not null)
> - list offsets
> - key buffers (flattened)
> - value buffers
>
> in the second case we have 1 additional buffer
>
> - struct null bitmap (map value is null / not null)
> - key offsets
> - key buffers
> - value offsets
> - value buffers
>
> We can do a zero-copy transformation from the former to the latter by
> rearranging the buffers and reusing the list offsets buffer. In both
> cases the keys and values are all contiguous.
>
> I agree that sorting is very useful; the metadata for Map should have
> a field indicating whether or not the keys are sorted within each map
> value
>
> - Wes
>
> On Wed, Jul 19, 2017 at 1:37 PM, Julian Hyde <jh...@apache.org> wrote:
>> List<Struct<K, V>> isn’t the only physical representation that makes sense. 
>> Because it doesn’t take advantage of the fact that (a) keys can be 
>> re-ordered, (b) keys are unique.
>>
>> So, another viable physical representation would be Struct<List<K>, 
>> List<V>>, with the keys sorted. If keys are constant width and in contiguous 
>> memory then binary search is very fast.
>>
>> I am not claiming that this physical representation is better than yours. 
>> But the fact that there is a more than one means it’s not a no-brainer.
>>
>> Julian
>>
>>
>>> On Jul 18, 2017, at 12:10 PM, Wes McKinney <wesmck...@gmail.com> wrote:
>>>
>>> I recently created https://issues.apache.org/jira/browse/ARROW-1207
>>> and wanted to discuss on the mailing list to hear opinions about how
>>> to proceed.
>>>
>>> Some systems, like Spark [1], Presto [2], or Drill have a Map<K, V>
>>> composite type. These are sometimes stored in Parquet as a repeated
>>> struct, or in Arrow types List<item: Struct<key: K, value: V>>.
>>>
>>> While we can represent in-memory map data as List<Struct<K, V>>, it
>>> may be useful to add a new logical type to the set of supported
>>> logical types [3]. The idea is that the memory format between Map<K,
>>> V> and List<Struct<K, V>> is identical, so this is strictly a logical
>>> construct, similar to date/time values having the same in-memory
>>> format as the corresponding integer types (int32/int64)
>>>
>>> For Arrow implementation that do not provide a first class Map
>>> container, they could process the data as though it were a repeated
>>> struct. It would be helpful to us in C++ to have an arrow::MapArray
>>> container because we could convert to / from this type and other data
>>> structures like Python dictionaries. It would also be helpful to
>>> faithfully transport the MAP logical type from Parquet [4]
>>>
>>> Let me know what others think. One question I have is whether the
>>> repeated struct in-memory representation makes sense as the canonical
>>> map representation.
>>>
>>> Thanks
>>> Wes
>>>
>>> [1]: 
>>> https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/types/MapType.html
>>> [2]: https://prestodb.io/docs/current/functions/map.html
>>> [3]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L157
>>> [4]: 
>>> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L63
>>

Reply via email to