Am I correct that under the current draft spec, the values in meaningless slots (such as slots corresponding to a 1 in an associated null bitmask, or unused slots in a sparse union) are undefined?
If so, might it be worth considering requiring that they be zeroed out? This would add some time overhead to array building (for sparse unions, memory could be zeroed out when it is allocated; for nullable arrays, it may be more efficient to do so when nulls are added), but would remove most of the complexity from hash calculations and equality comparisons. (See ARROW-32 and ARROW-38.) For example, comparison of nullable primitive arrays could be done with 2 calls to memcmp. As a bonus, it would speed up operations for which null and 0 are equivalent or near-equivalent (such as sum, sort on unsigned integers, ). Other than the overhead, one potential downside is that it would make it impossible to just add a null_bitmask over an existing array without copying all of the memory. Perhaps this would best be done with a separately defined masking vector, which would not require that all masked values be set to 0 (and would thus require more complex hashing algorithms).