Re: Adding a Map logical type to the Arrow metadata

Wes McKinney Wed, 19 Jul 2017 11:44:44 -0700

The only structural difference between

List<Struct<K, V>>


and

Struct<List<K>, List<V>>

is that in the latter case, the "key" value and the "value" value have
different offset vectors and thus can have different lengths.

So in the first case we have buffer structure:

- list null bitmap (map value is null / not null)
- list offsets
- key buffers (flattened)
- value buffers

in the second case we have 1 additional buffer

- struct null bitmap (map value is null / not null)
- key offsets
- key buffers
- value offsets
- value buffers

We can do a zero-copy transformation from the former to the latter by
rearranging the buffers and reusing the list offsets buffer. In both
cases the keys and values are all contiguous.

I agree that sorting is very useful; the metadata for Map should have
a field indicating whether or not the keys are sorted within each map
value

- Wes

On Wed, Jul 19, 2017 at 1:37 PM, Julian Hyde <[email protected]> wrote:
> List<Struct<K, V>> isn’t the only physical representation that makes sense. 
> Because it doesn’t take advantage of the fact that (a) keys can be 
> re-ordered, (b) keys are unique.
>
> So, another viable physical representation would be Struct<List<K>, List<V>>, 
> with the keys sorted. If keys are constant width and in contiguous memory 
> then binary search is very fast.
>
> I am not claiming that this physical representation is better than yours. But 
> the fact that there is a more than one means it’s not a no-brainer.
>
> Julian
>
>
>> On Jul 18, 2017, at 12:10 PM, Wes McKinney <[email protected]> wrote:
>>
>> I recently created https://issues.apache.org/jira/browse/ARROW-1207
>> and wanted to discuss on the mailing list to hear opinions about how
>> to proceed.
>>
>> Some systems, like Spark [1], Presto [2], or Drill have a Map<K, V>
>> composite type. These are sometimes stored in Parquet as a repeated
>> struct, or in Arrow types List<item: Struct<key: K, value: V>>.
>>
>> While we can represent in-memory map data as List<Struct<K, V>>, it
>> may be useful to add a new logical type to the set of supported
>> logical types [3]. The idea is that the memory format between Map<K,
>> V> and List<Struct<K, V>> is identical, so this is strictly a logical
>> construct, similar to date/time values having the same in-memory
>> format as the corresponding integer types (int32/int64)
>>
>> For Arrow implementation that do not provide a first class Map
>> container, they could process the data as though it were a repeated
>> struct. It would be helpful to us in C++ to have an arrow::MapArray
>> container because we could convert to / from this type and other data
>> structures like Python dictionaries. It would also be helpful to
>> faithfully transport the MAP logical type from Parquet [4]
>>
>> Let me know what others think. One question I have is whether the
>> repeated struct in-memory representation makes sense as the canonical
>> map representation.
>>
>> Thanks
>> Wes
>>
>> [1]: 
>> https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/types/MapType.html
>> [2]: https://prestodb.io/docs/current/functions/map.html
>> [3]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L157
>> [4]: 
>> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L63
>

Re: Adding a Map logical type to the Arrow metadata

Reply via email to