hey Dan,

You bring up a good point. It's unclear whether this would make
hashing much less complex (if you ignore the null bits when computing
the hash function, then it probably would, especially on repeated
fields. We'd need to do some experiments to see if there are cases
where skipping the null bits would yield bad hashes. I'm not sure)

For the memory comparison question, I'd like to hear from the Drill
team on their experience with ValueVectors and how they handle the
"null" space in arrays. In the case of list / repeated values, you
*do* need to initialize the null offset slot to the previous value so
that offset computations always work. For example, the list<int32>
array

[[0, 1, 2], null, null, [4, 5, 6]]

would need to have

nulls [LSB right-to-left ordering]: 0 0 0 0 0 1 1 0
offsets: 0 3 3 3 6
values 0 1 2 4 5 6

- Wes

On Mon, Mar 7, 2016 at 4:08 PM, Daniel Robinson
<danrobinson...@gmail.com> wrote:
> Am I correct that under the current draft spec, the values in meaningless
> slots (such as slots corresponding to a 1 in an associated null bitmask, or
> unused slots in a sparse union) are undefined?
>
> If so, might it be worth considering requiring that they be zeroed out?
> This would add some time overhead to array building (for sparse unions,
> memory could be zeroed out when it is allocated; for nullable arrays, it
> may be more efficient to do so when nulls are added), but would remove most
> of the complexity from hash calculations and equality comparisons. (See
> ARROW-32 and ARROW-38.)  For example, comparison of nullable primitive
> arrays could be done with 2 calls to memcmp.  As a bonus, it would speed up
> operations for which null and 0 are equivalent or near-equivalent (such as
> sum, sort on unsigned integers, ).
>
> Other than the overhead, one potential downside is that it would make it
> impossible to just add a null_bitmask over an existing array without
> copying all of the memory. Perhaps this would best be done with a
> separately defined masking vector, which would not require that all masked
> values be set to 0 (and would thus require more complex hashing algorithms).

Reply via email to