I created a PR to assist with further discussion:
https://github.com/apache/arrow/pull/876

On Wed, Jul 19, 2017 at 4:14 PM, Julian Hyde <jh...@apache.org> wrote:
> I see. It took me a while to understand, but it all made sense when I
> realized that we are not looking at one Map instance but multiple
> rows, each with a Map instance, and the constituents parts of those
> Maps are stored end-to-end.
>
> Julian
>
>
> On Wed, Jul 19, 2017 at 11:42 AM, Wes McKinney <wesmck...@gmail.com> wrote:
>> The only structural difference between
>>
>> List<Struct<K, V>>
>>
>> and
>>
>> Struct<List<K>, List<V>>
>>
>> is that in the latter case, the "key" value and the "value" value have
>> different offset vectors and thus can have different lengths.
>>
>> So in the first case we have buffer structure:
>>
>> - list null bitmap (map value is null / not null)
>> - list offsets
>> - key buffers (flattened)
>> - value buffers
>>
>> in the second case we have 1 additional buffer
>>
>> - struct null bitmap (map value is null / not null)
>> - key offsets
>> - key buffers
>> - value offsets
>> - value buffers
>>
>> We can do a zero-copy transformation from the former to the latter by
>> rearranging the buffers and reusing the list offsets buffer. In both
>> cases the keys and values are all contiguous.
>>
>> I agree that sorting is very useful; the metadata for Map should have
>> a field indicating whether or not the keys are sorted within each map
>> value
>>
>> - Wes
>>
>> On Wed, Jul 19, 2017 at 1:37 PM, Julian Hyde <jh...@apache.org> wrote:
>>> List<Struct<K, V>> isn’t the only physical representation that makes sense. 
>>> Because it doesn’t take advantage of the fact that (a) keys can be 
>>> re-ordered, (b) keys are unique.
>>>
>>> So, another viable physical representation would be Struct<List<K>, 
>>> List<V>>, with the keys sorted. If keys are constant width and in 
>>> contiguous memory then binary search is very fast.
>>>
>>> I am not claiming that this physical representation is better than yours. 
>>> But the fact that there is a more than one means it’s not a no-brainer.
>>>
>>> Julian
>>>
>>>
>>>> On Jul 18, 2017, at 12:10 PM, Wes McKinney <wesmck...@gmail.com> wrote:
>>>>
>>>> I recently created https://issues.apache.org/jira/browse/ARROW-1207
>>>> and wanted to discuss on the mailing list to hear opinions about how
>>>> to proceed.
>>>>
>>>> Some systems, like Spark [1], Presto [2], or Drill have a Map<K, V>
>>>> composite type. These are sometimes stored in Parquet as a repeated
>>>> struct, or in Arrow types List<item: Struct<key: K, value: V>>.
>>>>
>>>> While we can represent in-memory map data as List<Struct<K, V>>, it
>>>> may be useful to add a new logical type to the set of supported
>>>> logical types [3]. The idea is that the memory format between Map<K,
>>>> V> and List<Struct<K, V>> is identical, so this is strictly a logical
>>>> construct, similar to date/time values having the same in-memory
>>>> format as the corresponding integer types (int32/int64)
>>>>
>>>> For Arrow implementation that do not provide a first class Map
>>>> container, they could process the data as though it were a repeated
>>>> struct. It would be helpful to us in C++ to have an arrow::MapArray
>>>> container because we could convert to / from this type and other data
>>>> structures like Python dictionaries. It would also be helpful to
>>>> faithfully transport the MAP logical type from Parquet [4]
>>>>
>>>> Let me know what others think. One question I have is whether the
>>>> repeated struct in-memory representation makes sense as the canonical
>>>> map representation.
>>>>
>>>> Thanks
>>>> Wes
>>>>
>>>> [1]: 
>>>> https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/types/MapType.html
>>>> [2]: https://prestodb.io/docs/current/functions/map.html
>>>> [3]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L157
>>>> [4]: 
>>>> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L63
>>>

Reply via email to