[ 
https://issues.apache.org/jira/browse/ARROW-38?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192523#comment-15192523
 ] 

Dan Robinson commented on ARROW-38:
-----------------------------------

I think a breadth-first traversal of each slot might be simplest. You can 
recursively compute a unique "record" for each top-level slot by concatenating 
the range of values corresponding to that slot with the record of that slot for 
each of its child arrays. 

For each level, you could define a "record array."

* For a primitive array, the record array is simply the values array.
* For a list, the record array is the array of lengths of its slots, i.e. the 
first order difference of its offset array.  (This could be precomputed to take 
advantage of SIMD and make the algorithms more parallelizable, or calculated 
while traversing so it only takes constant space.)
* For a sparse union, the record array is the array of types.
* A dense union seems somewhat more complex; I'm sure there's a better way, but 
at worst we could just convert it to a sparse union before calculating hashes.
* For a struct, there is no record array (the contents of the record are 
entirely determined by the contents of its child arrays).
* For a nullable array, the record array is the null bitmask. (If all arrays 
are considered nullable, rather than nullable arrays being treated as a nested 
"wrapper" around other types, this logic can be included in all array types.)

The record arrays for each level of the example array shown at 
https://github.com/apache/arrow/blob/master/format/diagrams/layout-list-of-list.png
 would therefore be:

2 | 3
2 | 3 | 3 | 1 | 2
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10

Then, for each of the two top-level slots in the array, do a breadth-first 
traversal of the record arrays, over the ranges corresponding to that slot. (We 
could also precalculate the "top-level offsets" for each array by successively 
slicing offset arrays.)

The record of each slot in the example list would be:

2 | 2 | 3 | 0 | 1 | 2 | 3 | 4 | 
3 | 3 | 1 | 2 | 5 | 6 | 7 | 8 | 9 | 10

These records would not actually need to be created in memory; hashes could be 
calculated sequentially, and comparisons could use one memcmp for each level.

(Note, as I alluded to on the e-mail list, that the logic for sparse unions and 
nullable arrays depends on "empty" slots in child arrays being initialized to 
0.)

> C++: Algorithms for using nested types in a hash table context
> --------------------------------------------------------------
>
>                 Key: ARROW-38
>                 URL: https://issues.apache.org/jira/browse/ARROW-38
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Wes McKinney
>
> Computing hash values (and performing equality comparisons) for top-level 
> slots in nested-type data (for example, computing DISTINCT on a 
> {{List<List<Int32>>}}, related: ARROW-32) can be fairly complex. 
> Additionally, value slots at any level of the type tree can be null. 
> We should explore various algorithms for their performance and memory use in 
> practical settings. For example, one can compute a contiguous "record" / byte 
> array resulting from a depth-first traversal of a single value slot for the 
> purposes of computing a hash value or comparing with another slot. If anyone 
> has other ideas from past experiences I would be keen to learn more.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to