Am I correct that under the current draft spec, the values in meaningless
slots (such as slots corresponding to a 1 in an associated null bitmask, or
unused slots in a sparse union) are undefined?

If so, might it be worth considering requiring that they be zeroed out?
This would add some time overhead to array building (for sparse unions,
memory could be zeroed out when it is allocated; for nullable arrays, it
may be more efficient to do so when nulls are added), but would remove most
of the complexity from hash calculations and equality comparisons. (See
ARROW-32 and ARROW-38.)  For example, comparison of nullable primitive
arrays could be done with 2 calls to memcmp.  As a bonus, it would speed up
operations for which null and 0 are equivalent or near-equivalent (such as
sum, sort on unsigned integers, ).

Other than the overhead, one potential downside is that it would make it
impossible to just add a null_bitmask over an existing array without
copying all of the memory. Perhaps this would best be done with a
separately defined masking vector, which would not require that all masked
values be set to 0 (and would thus require more complex hashing algorithms).

Reply via email to